A POWER  STUDY  OF  THREE  PROCEDURES 
FOR  THE  ASSESSMENT  OF  UNIDIMENSIOKALITY 


By 

ANNE  ELIZABETH  SERAPHINE 


A DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 
OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 
OF  THE  REQUIREMENTS  FOB  THE  DEGREE  OF 
DOCTOR  OF  PHILOSOPHY 


UNIVERSITY 


FLORIDA 


ACKNOWLEDGEMENTS 

1 would  like  to  acknowledge  the  contributions  of  my 
committee  members  toward  the  completion  of  this  study.  1 
would  like  to  thank  Dr.  H.  David  Hiller  and  Dr.  James  J. 
Algina  for  providing  insightful  technical  and  theoretical 
guidance  throughout  the  study.  I am  also  grateful  to  the 

Jin-Wen  Hsu,  and  Dr.  Peter  A.  D.  Sherrard  for  their  valuable 


ACKNOWLEDGEMENTS ii 

ABSTRACT v 


82 

r, 

Si 


iii 


APPENDIX. 


REFERENCES I39 

BIOGRAPHICAL  SKETCH I47 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 

A POWER  STUDY  OF  THREE  PROCEDURES 
FOR  THE  ASSESSHENT  OF  UNIDIKEHSIOHALITY 

By 

Anne  Elizabeth  Seraphine 
August,  1994 

Chair:  M.  David  Hiller 
Cochair:  James  J.  Algina 

Hajor  Department:  Foundations  of  Education 

Interest  in  the  development  and  refinement  of 
procedures  designed  to  assess  dimensionality  has  increased 
with  the  onset  of  item  response  theory  based  technologies. 
Several  promising  procedures  for  the  assessment  of 
unidimensionality  have  been  developed.  Of  these,  the  Stout 
T procedure,  the  Holland-Rosenbaum  procedure,  and  the  full- 
information  factor  analytic  procedure  were  selected  for 
investigation  in  the  present  study. 

The  power  of  the  Stout  T procedure,  the  Holland- 
Rosenbaum  procedure,  and  the  full-information  factor 

various  combinations  of  sample  size  (J  v 500,  1,000,  1,500), 
test  length  (N  = 25,  50),  inter-trait  correlation  (r  = .3, 


type 


of  the  24  conditions  was  replicated  100  times. 

For  most  factor  combinations  the  full-infomation 
factor  analytic  procedure  outperformed  the  other  two 
procedures,  whereas  the  Stout  T procedure  outperformed  the 

Holland-Rosenbaxim  procedure  showed  good  to  excellent  power 
under  selected  conditions  (J  = 50,  r » .3)  for  the  normal 
distribution.  The  power  of  all  procedures  was  reduced  in 
the  presence  of  skewed  data.  Of  the  three,  the  full- 
infomation  factor  analytic  procedure  was  the  only  one  that 
showed  adequate  power  under  the  lognormal  condition. 


CHAPTER  1 
INTRODUCTION 


Interest  in  the  development  and  reiinement  of 
procedures  designed  to  assess  dloensionality  has  increased 
with  the  onset  of  item  response  theory  based  technologies. 
Over  80  indices  for  the  assessment  of  unidimensionality  have 
been  reported  in  the  literature  (Hattie,  1995),  many  of 
which  have  been  shown  to  be  ineffective  and/or  atheoretical 
(Hambleton  & Rovinelli,  1986;  Hattie,  1984,  1985). 

Effective  procedures  for  the  assessment  of  unidimensionality 
generally  share  common  characteristics:  They  should  be  (a) 

theoretically  sound  (Hattie,  1985),  and  <b)  sensitive  to 

acceptable  Type  I error  rate  (Stout,  1987).  Judgments  of  a 

of  specifications;  that  is,  judgements  should  emphasize 
relative  performance  rather  than  absolute  performance. 


Idimensionality.  Based  on  his  findings,  he  concluded  that 
z most  effective  approaches  are  based  on  nonlinear  factor 


lalysis  (i.e.,  me 


KcDonaldj  1967);  all  of  which  involve  the  examination  of 
absolute  residuals  (i.e.,  sum  of  the  residuals  or  the  number 
of  residuals  exceeding  .01).  However,  because  such 
procedures  lack  an  objective  criterion,  they  often  render 
inconsistent  results. 

As  an  attempt  to  impose  an  objective  criterion, 
procedures  based  on  the  chi-square  statistics  associated 
with  nonlinear  factor  solutions  have  been  recommended. 
However,  many  of  these  exhibit  shortcomings.  For  instance, 
the  generalized  least  squares  (GLS)  solutions  proposed  by 
Christoffersson  (197S)  and  Huthen  (1978)  are  limited  to  data 
sets  of  40  items  or  less  due  to  high  computational  demands 
(Huthen,  1984).  Because  of  this,  the  chi-square  statistics 
based  on  these  solutions  cannot  be  recommended  for  practical 

Bock.  Gibbons,  and  Huraki  (1988)  claimed  the  chi-square 
associated  with  full-information  factor  analysis  can  be 
estimated  for  test  lengths  of  100  items  or  more.  Therefore, 
the  chi-square  based  on  the  full-information  factor  model 
can  be  considered  to  be  a viable  alternate  approach  for  the 
assessment  of  unidiiaensionality  (Zwtck,  1987).  Thus  far, 
this  procedure  has  been  examined  only  under  the 
unidimensional  condition  (Zwiek,  1987).  According  to  Zwick 
(1987),  the  performance  of  the  procedure  can  be  less  than 
perfect;  for  example,  the  chi-square  difference  often 
rejected  the  one-factor  solution  in  favor  of  a two-factor 


solution  (Zwick,  1987).  However,  the  generalizability  of 
these  findings  is  somewhat  limited,  because  only  a single 
sample  size  (J  a 1,000)  was  examined  and  no  replications 
were  completed.  Thus,  it  may  be  worthwhile  to  examine  the 
performance  of  this  procedure  for  conditions  not  included  in 
Zwick's  study  (1987)  (e.g.,  different  sample  sizes,  test 
lengths,  number  of  dimensions). 

Two  approaches  based  on  nonparametric  models  also 
appear  to  be  promising.  First,  the  Rosenbaum  (1984) 
procedure,  later  known  as  the  Holland-Rosenbaum  (1986) 
procedure,  has  been  recommended  as  an  effective  approach 
(Hattie,  1985).  Zwick  (1987)  found  the  procedure  to  be 
effective  in  detecting  unidimensionality  under  the  null 
condition  for  data  sets  with  unusual  patterns  of  missing 
data.  Ben-Slmon  and  Cohen  (1990),  however,  found  the 
procedure  to  be  overly  conservative  under  both 
unidimensionality  and  multidimensionality.  In  line  with 
this,  the  procedure  was  shown  to  be  conservative  in  a monte 
carlo  study  conducted  by  Nandakumar  (1994).  No  conclusions, 
however,  should  be  made  on  the  basis  of  these  studies' 
findings  (i.e.,  studies  of  Ben-Simon  6 Cohen,  1990; 
Nandakumar,  1994;  Zwick,  1987);  the  findings  Of  these 
studies  lack  generalizability,  because  few,  if  any, 
replications  were  completed. 

Although  the  procedure  is  conservative,  the  Holland- 
Rosenbaum  procedure  shows  promise,  particularly  for  skewed 


data.  The  Mantel-Haenszel  statistic  is  nonparanetric  and 
the  procedure  appears  to  be  distribution  free  (Holland  & 
Rosenbaum,  1986).  This  may  be  an  important  attribute 
because  latent  ability  distributions  of  item  response  data 
sets  may  be  s)iewed  due  to  selection  processes  (De  Champlain 
S Tang,  1993). 

Finally,  the  Stout  procedure  (1987)  was  proposed  as  an 
effective  way  to  assess  the  unidimensionality  of  responses 
sets.  It  was  originally  developed  to  determine  whether  a 
data  set  was  essentially  unidimensional.  Stout  (1987)  found 
that  the  statistic  bad  good  to  moderate  power  in  detecting 
multidimensionality  unless  the  sample  size  was  small  (J  - 
7S0)  or  the  test  length  was  short  (H  = 25).  Similarly, 
Nanda)(umar' s (1987)  findings  indicated  that  the  power  of  the 
procedure  is  poor  for  test  lengths  of  25  items  or  less. 

Although  the  Stout  T statistics,  the  conservative  T 
(T^)  and  the  more  powerful  T (T,),  are  nonparametric;  the 
procedure  may  be  distribution  dependent.  This  contention 
has  been  questioned,  however.  Although  De  Champlain  and 
Tang  (1993)  found  that  the  Type  1 error  rates  of  the 
statistic  were  inflated  for  nonnormal  latent  variables, 
NandaJcumar  (1994)  found  the  Type  1 error  rate  remained  at 
.05  for  moderately  skewed  data. 

It  appears  several  promising  procedures  for  the 
assessment  of  unidimensionality  have  been  developed.  A 
parametric  procedure  is  the  full-information  factor  analytic 


(FIFA)  procedure  (Bock,  Gibbons,  & Kuraki,  19B8;  Zwick, 
1987).  The  nonparametric  procedures  are  the  Holland- 
Bosenbaum  (HB)  procedure  (Holland  & Rosenbaua,  1988; 
Rosenbaua,  1984)  and  the  Stout  T (ST)  procedure  (Stout, 
1987;  1990). 


The  Problem 

Althoush  a number  of  monte  carlo  studies  have  been 
conducted  to  examine  the  effectiveness  of  the  individual 
approaches  to  the  assessment  of  unidimensionality,  no  study 
has  yet  been  conducted  to  examine  the  relative  pOMer  of  the 


procedures  (Hambleton  & Rovinelli,  1986;  Hattie,  1984; 

Zwick,  1987)  or  failed  to  include  the  procedures  of  interest 
(De  Champlain  St  Tang,  1993;  Handakumar,  1994).  The 
comparative  studies  that  examined  one  or  more  of  the  three 
procedures  were  beset  by  serious  limitations,  such  as  a lack 
of  replications  or  unrealistic  data  conditions  (Ben-Simon  6 
Cohen,  1990;  Gessaroli  & De  Champlain,  1992;  Hambleton  & 
Rovinelli,  1986;  Hattie,  1985;  Nandakumar,  1991).  Moreover, 


rlption  of 


providing  an  indication  of  how  closely  the  observed  Type  I 
error  rate  of  each  procedure  adheres  to  the  nominal  Type  1 
error  rate.  Less  work,  however,  has  heen  done  under  the  two 
dimensional  condition.  In  fact,  several  of  these  studies 
had  only  a few,  if  any,  replications,  which  precludes  them 
from  being  considered  power  studies  (Ben-Simon  £■  Cohen, 

1990;  Nandakumar,  1994;  Zwick,  1987). 

Even  so,  the  findings  of  earlier  studies  have  suggested 
that  the  procedures  are,  for  the  most  part,  sensitive  to 
departures  from  unidimensionality  for  specific  combinations 
of  sample  size,  test  length,  and  inter-trait  correlation, 
Moreover,  it  is  suspected  that  the  power  of  these  procedures 
may  be  influenced  by  distributional  characteristics,  because 
either  the  procedure  itself  is  distribution  dependent  or  the 
statistic  is  parametric. 

Pumose  of  the  Study 

The  primary  intent  of  this  study  is  to  examine  the 
effects  of  sample  size  (J  = 500,  1,000,  1,500),  test  length 
IN  = 25,  50),  distributional  characteristics  (DT  = normal, 
lognormal),  and  their  interactions  on  the  power  of  the 
selected  procedures  for  two-dimensional  data  in  the  presence 
of  low  and  moderately  high  inter-trait  correlations  (r  = 

0.3,  0.7).  A secondary  aim  is  to  examine  the  power  of  each 
procedure  across  identical  conditions  to  facilitate 
comparisons  of  their  relative  effectiveness. 


Sionifieanee  of  the  Study 


Item  calibration  and  score  interpretation  are  at  the 
heart  of  all  psychometric  applications.  One  of  the  leading 
procedures  for  item  calibration  has  been  item  response 
theory;  it  is  currently  being  applied  to  a number  of 
practical  measurement  problems  such  as  test  equating,  test 
construction,  differential  item  functioning,  and 
computerized  adaptive  testing.  Yet,  the  procedure  is  based 
on  the  "fragile"  (Traub,  1983)  assumption  of 
unidimensionality  (Hambleton  8 Swaminathan,  198S). 
Unldimensionallty  is  untenable  for  most  item  response  sets, 
because  test  scores  are  influenced  by  a host  of  variables 
common  to  test  settings — educational  training,  test 
speededness,  and  examinees'  propensities.  As  a result, 
assessing  the  unidimensionality  of  a data  set  is  one  of  the 
most  fundamental  problems  facing  psychonetricans  (Goldstein, 
1980;  Traub,  1983). 

To  address  this  problem,  the  aforementioned  procedures 
for  the  assessment  of  unidimensionality  have  been  proposed. 
Yet,  the  success  of  any  given  procedure  depends  to  a large 
extent  on  the  appropriateness  of  the  procedure  for  a 
particular  application.  The  first  step  in  judging  the 
appropriateness  of  a procedure  is  to  understand  its 
performance  in  the  presence  of  various  factor  combinations 
for  both  unidimensional  and  multidimensional  data  sets. 


distributional  characteristics  on  the  power  of  each 
procedure.  Such  inforioation  can  then  be  used  to  guide  the 
selection  and  use  of  the  appropriate  procedure.  In 
addition,  the  findings  of  the  study  should  provide  guidance 
to  researchers  in  the  guest  to  understand  and  refine 
existing  indices,  and  perhaps  spur  efforts  in  the 
development  and  refinement  of  new  and  better  indices. 


CHAPTER  2 

REVIEW  OF  THE  LITERATURE 

Central  to  all  psychonetric  applications  is  item 
calibration  and  score  interpretation.  During  the  past 
decade  item  response  theory  (IRT)  has  been  used  extensively 
for  item  calibration  with  applications  such  as  test 
eguating,  test  construction,  differential  item  functioning, 
and  computerized  adaptive  testing  (Linn,  1989).  However, 
item  response  theory  produces  accurate  calibrations  only 
when  certain  conditions  are  fulfilled.  It  appears  that  the 
most  fundamental  of  these  may  be  the  assumption  of 
unidimensionality  (Drasgow  & Parsons,  1983;  Hambleton  6 
Swaminathan,  1985;  Reckase,  1979;  Stout,  1987;  Traub,  1983), 

Violations  of  Unidimensionality 

The  robustness  of  unidinensional  item  response  models 
to  violations  of  unidimensionality  is  dependent  on  the 


1983, 


198) . 


10 

empirical  findings  support  this  claim  (Drasgow  6 Parsons, 
1963;  Harrison,  1966;  Recksse,  1979). 

Furthermore,  the  IRT  model  will  often  provide  an 
adequate  description  of  a data  set  comprised  of  multiple 
"psychological"  traits,  resulting  in  a single  "statistical" 
trait  as  described  by  Recksse  (1990).  For  example,  Recksse, 
Ackerman,  and  Carlson  (1966)  provided  evidence  that  a 
unidimensional  test  could  be  built  from  a multidimensional 
set  of  items  that  each  measure  an  equally  weighted  composite 
of  abilities.  Through  her  theoretical  work,  Wang  (1966) 
sl^owed  that  unidimensional  analysis  of  multidimensional  data 
will  result  in  estimates  of  ability  as  some  weighted 
composite  or  "reference  composite"  of  the  actual  underlying 
abilities.  This  relationship  is  based  on  the  assumption 
that  the  structure  of  the  data  conforms  to  the  two-parameter 
compensatory  multidimensional  logistic  item  response  model 
(Reckase  & McKinley,  1983),  and  that  the  existence  of  a 
reference  composite  is  not  a function  of  the  prepotency  of 
some  general  factor. 

Not  all  multidimensional  data  sets  exhibit  the 
structure  to  uphold  the  assinnption  of  unidimensionality. 

That  is,  the  dominant  trait  may  not  "be  sufficiently 
prepotent"  or  the  weighted  composite  of  abilities  may  vary 
across  items.  A number  of  research  findings  have  indicated 
that  under  these  conditions  the  use  of  the  unidimensional 
Item  response  model  will  lead  to  biased  parameter  estimates 


(Ackerman,  1989;  Diaagow 


1983; 


1988; 


Reckase,  1979). 

For  Instance,  Reckase  (1979)  found  that  when  there  is 
no  dominant  factor,  LOGIST  (Wood,  Wlngeraky  & Lord,  1976)  is 

necessarily  the  trait  on  which  the  raw  scores  are  based. 
Related  to  this,  Drasgow  and  Parsons  (1983)  found  that  as 
the  prepotency  of  the  general  factor  decreases,  or  rather  as 
factor  intercorrelations  drop  to  .39  and  below,  LOGIST  is 
drawn  to  a group  factor  for  ability  estimates  and  for  many 
items  overestimates  the  b parameter.  Other  studies  support 
these  findings  (Ackerman,  1989;  Harrison,  1986). 

In  this  case,  multidimensional  item  response  models  may 
provide  more  accurate  estimates  of  person  and  item 
parameters  than  unidimensional  models.  Thus  far,  estimation 

multidimensional  logistic  model  or  HIRT  (Reckase  9 McKinley, 
1983).  Unfortunately,  a number  of  unresolved  problems  are 
associated  with  the  estimation  and  use  of  the 
multidimensional  model . One  seemingly  insurmountable 
problem  is  rotational  indeterminacy;  this,  alone,  has 

most  psychometric  applications.  With  a reliance  on 
unidimensional  item  response  theory,  comes  the  necessity  to 


c3evelop,  test,  and  refine  the  technology  to 
violations  of  the  fundaeiental  assumption  of 
unidimensionality.  A number  of  procedures  to  assess 
unidioenslonality  have  been  proposed  in  the  literature,  many 
of  which  are  described  later  in  this  chapter. 
Conceptualisations  of  Unldlnensionalitv 

Presented  first  are  definitions  of  unidimensionality, 
followed  by  a review  of  various  procedures.  A rigorous 
definition  of  unidimensionality  should  serve  as  a benchmark 
for  judging  the  adeguacy  of  each  procedure's  performance. 
However,  there  is  little  agreement  on  what  is  meant  by  the 
term  unidimensionality  (McDonald,  1991).  As  a result, 
researchers  often  face  the  seemingly  insuperable  task  of 
trying  to  differentiate  between  what  is  meant  by  the  two 
terms,  unidimensionality  and  multidimensionality. 

Reckase  (1990)  proposed  a framework  that  helps  clarify 
the  distinction  between  the  two  terras.  One  source  of 
confusion  may  be  the  failure  to  recognise  the  existence  of 
two  types  of  dimensionality:  "psychological  dimensionality" 

and  "statistical  dimensionality."  Psychological 
dimensionality  is  the  number  of  underlying  constructs  or 
traits  required  to  perform  successfully  on  some  measure, 
whereas  statistical  dimensionality  is  "the  minimum  number  of 
mathematical  variables  that  is  needed  to  summarize  a matrix 
of  item  response  data"  (Reckase,  1990,  p.  1). 


structure  principle  always  holds  for  some  single  latent 
variable,  it  is  the  regression  principle  that  provides 


teatable  consequences. 

The  "regression  principle"  refers  to  model  assumptions 
used  to  explain  the  dependency  structures  of  the  data 
matrix,  whereas  the  "conditional  structure  principle"  refers 
to  the  conditional  distribution  of  the  manifest  variable 
given  some  latent  variable  (McDonald,  1979).  Descriptions 
of  the  regression  principle  and  the  conditional  structure 
principle  are  presented  in  the  following  paragraphs. 

The  regression  principle 

Latent  functions  can  assume  a number  of  regression 
forms.  McDonald  (19B2)  provided  a general  framework  for  the 
classification  of  latent  trait  models:  (a)  models  with 

linear  coefficients  and  linear  latent  traits;  (b)  models 
with  linear  coefficients  and  nonlinear  latent  traits;  and 
(c)  models  with  nonlinear  coefficients  and  nonlinear  latent 
traits.  Because  some  have  advocated  the  use  of 
nonparametric  weak  monotonic  functions  to  model  latent 
traits  (Holland  & Rosenbaum,  1986;  Rosenbaum,  1984;  Stout, 
1987,  1990),  McDonald’s  classification  system  Is  extended  to 
include  a fourth  category:  nonparametric  monotonic  models. 

Each  of  the  four  function  forms  are  briefly  described. 

The  first  type  of  model  (i.e.,  linear  coefficients  and 
linear  latent  traits)  consists  of  a linear  factor  model 
based  on  phi  coefficients.  Definitions  of  unidimensionality 


linear  me 


(McDonald,  1981,  p.  101)  are  usually  inappropriate  for 
binary  data  for  several  reasons.  First  of  all,  if  a 
strictly  linear  model  is  applied  to  data  with  the  nonlinear 
structure  usually  exhibited  by  binary  data,  the  factor 
structure  is  li)cely  to  be  comprised  of  both  content  (i.e., 
the  latent  traits)  and  spurious  factors  (i.e.,  the  factors 
that  result  from  curvature,  a function  of  item  means).  This 
will  hold  unless  the  "regressions  of  the  items  on  the  latent 
traits  are  linear"  (McDonald  6 Ahlawat,  1974,  p.  85),  which 
is  unlikely. 

According  to  Mislevy  (1986),  assuming  a linear 
relationship  between  the  latent  variable  and  binary  item 
responses  is  problematic  "because  the  value  of  a dichotomous 
variable  is  bounded,  implying  that  its  regression  on  any 

(p.  9).  In  other  words,  a nonlinear  model  is  the  outcome  of 
regressing  a bounded  binary  variable  on  a continuous  latent 

As  a result,  for  binary  data  sets  it  is  more 
appropriate  to  define  unidiraensionality  in  terms  of  a 

models,  subsumes  those  with  linear  coefficients  and 


(McDonald,  1981,  p.  IIS).  McDonald  (1962] 


strong  form  of  local  independence  Is  the  underlying 
assumption  of  common  latent  trait  models. 


The  basic  postulate  of  latent  structure  analysis  in  its 
general  form  is  that 


Thus,  for  strict  unidimensionality  to  hold;  the  strong  form 

structures  with  a single  underlying  trait  (Holland  6 
Rosenbaum,  1986;  McDonald,  1981).  McDonald  contended  the 

on  any  other  form  of  the  structural  conditional  principle. 

Although  strict  unidimensionality  is  defined  in  terms 
of  local  independence,  it  is  commonly  assessed  according  to 
the  weak  principle  of  unidimensionality.  The  weak  principle 
of  conditional  independence  holds  when  items  scores  for 
fixed  latent  traits  do  not  covary  (McDonald,  1979). 

McDonald  (1981)  explained  that  this  assumption  often 


h(*i>  =Jh(x,  !♦>?(•)  d9 


12) 


h(Xil«).|jlj^(x,|»), 


whence,  substituting  (2)  into  (i; 


(3) 


s Si'  ej V ( 


ill! 


(McDonald,  1981,  p.  115).  McDonald  (1962)  showed  that  the 
strong  form  of  local  independence  is  the  underlying 
assumption  of  common  latent  trait  models. 

In  general  we  may  write 

The  basic  postulate  of  latent  structure  analysis  in  its 
general  form  is  that 

whence,  substituting  (2)  into  (1) 

Thus,  for  strict  unidimensionality  to  hold;  the  strong  form 
of  conditional  independence  must  hold  for  dependency 
structures  with  a single  underlying  trait  (Holland  & 
Rosenbaum,  1986;  McDonald,  1981).  McDonald  contended  the 
essence  of  unidimensionality  changes  irrevocably  when  based 
on  any  other  form  of  the  structural  conditional  principle. 

Although  strict  unidimensionality  is  defined  in  terms 
of  local  independence,  it  is  commonly  assessed  according  to 
the  wealc  principle  of  unidimensionality.  The  wealc  principle 
of  conditional  independence  holds  when  items  scores  for 
fixed  latent  traits  do  not  covary  (McDonald,  1979). 

McDonald  (1981)  explained  that  this  assumption  often 
accompanies  common  factor  analysis,  "because  under  the 


19 

assumptions  of  multivariate  normality,  the  weak  form  of  the 
principle  implies  the  strong  form,  as  well  as  conversely" 

(p.  116).  This  can  be  explained  in  terms  of  the  third  and 
fourth  moments  of  the  multivariate  normal  distribution:  the 

third  moment  is  null,  whereas  the  fourth  moment  is  a 
function  of  the  first  two  moments  (McDonald,  1981). 

Stout  (1987,  1990)  questioned  whether  conditional 

Based  on  the  assumption  that  the  complete  latent  space  is 
best  defined  in  terms  of  a single  dominant  trait  and  one  or 
more  nuisance  traits.  Stout  posited  essential  independence 
as  more  realistic  than  traditional  assumptions.  Stout 
(1990,  p.  297)  provided  the  following  formal  definition  of 
the  assumption:  "The  latent  model  (U,  ,0,  N i 1)  is  said  to 

he  essentially  independent  (El)  if  for  every  collection  of 
nonsparse  subtests  for  each  9 in  the  range  of  8, 

E cov(U,,V,\&=6)- 0 

In  other  words,  as  the  number  of  items  approaches  infinity, 
expected  item  covariances  conditioned  on  some  latent  trait 
is  small  in  magnitude  across  all  theta  (Nandakumar,  1991). 


less  restrictive  than  conditional  independence.  Hence,  it 
leads  to  a form  of  unidimensionality  somewhat  different  from 
that  of  traditional  unidimensionality.  Stout  (1987)  termed 
this  less  restrictive  form,  essential  unidimenaionality . 


tradltic 


dimensionality 


21 

facilitate  an  understanding  of  the  interconnection  between 

Essential  unidimensionallty  encompasses  traditional 
unidiraensionality:  if  a data  set  is  strictly 


unidimensional,  : 

It  will  also  be  essentially  unidimensional. 

But  the  converse 

is  not  true.  That  is,  provided  that  cov 

(Uj,  Uj  1 e = 8) 

■*  0 for  all  i and  j,  an  essentially 

unidimensional  d< 

9ta  set  will  be  considered  to  be 

multidimensional 

instead  of  unidimensional  in  the  strictest 

Traditional 

unidimensionality  holds  for  data  sets  when 

a single  latent  1 

:rait  accounts  for  dependency  structures 

under  conditional  Independence.  As  mentioned  previously, 
either  a parametric  or  a nonparanetric  item  response 
function  can  be  used.  McDonald  (1981)  contended  that  for 
unidimensionality  to  hold,  a one  factor  nonlinear  function 
must  conform  to  the  data.  In  contrast,  Holland  and 
Rosenbaum  (1986)  suggested  it  is  sufficient  for  a 


nonparametric  we< 

ik  monotonic  model  with  a single  trait  to 

fit  the  data.  11 

conditional  independence  does  not  hold  for 

either  single  trs 

lit  model,  one  can  assume  the  data  set  is 

multidimensional . 

However,  strict  enforcement  of  conditional  Independence 
excludes  data  sets  with  a single  dominant  trait  and  one  or 
more  minor  traits  from  being  classified  as  unidimensional. 


This  is  considered  problematic  by  some. 


Such  dependency 


22 

a single  trait,  but  are  also  preferable.  Humphreys  (1985) 
argued  that  tests  which  elicit  strictly  unidinensional  item 
response  sets  often  engender  overly  narrow  interpretations 
of  examinees'  performances. 

In  line  with  this  reasoning,  Goldstein  (1980) 
questioned  the  practice  of  defining  unidinensionality  in 
terms  of  conditional  independence.  Goldstein  (1980) 
delineated  the  problem  in  terms  of  the  unique  vectors’ 
behavior,  it  is  well  known  that  local  independence 
presupposes  the  orthogonality  of  the  unique  vectors 
(Goldstein,  1980;  Harman,  1976;  McDonald,  1981).  However, 
according  to  Goldstein  (1980),  the  assumption  of 
orthogonality  rarely  holds  for  most  item  response  data  sets. 
In  the  absence  of  orthogonality,  clusters  of  unique  vectors 
combine  to  form  one  or  more  "nuisance  factors." 

Nuisance  factors  are  usually  functions  of  the  testing 
environment,  instructional  effects,  or  characteristics  of 
the  instrument.  Because  of  the  pervasiveness  of  these 
factors,  Goldstein  (1980)  suggested  that  the  definition  of 
unidimensionality  should  be  extended  to  Include  data  sets 
with  a single  dominant  trait  and  several  nuisance  traits. 
Likewise,  Stout  (1987)  contended  that  local  independence  is 
rarely  satisfied  for  most  data  sets.  Re  formulated  essential 

essential  Independence  comes  essential  unidimensional 


Lity. 


23 


Essential  unidimensionallty  assumes  the  existence  of  a 
single  dominant  trait  in  the  presence  of  minor  traits  or 
nuisance  factors.  Stout  (1990)  provided  a formal  definition 
of  essential  dimensionality; 

The  essential  dimensionality  dj  of  a test  {Uj,  i r 1) 
is  the  minimal  dimensionality  required  for  a latent 
trait  6 to  make  the  latent  trait  model  an  El 
(essentially  independent],  WM  [weakly  monotone]  model. 

hold.  If  essential  d,  dimensionality  holds  using 
ability  e then  { U,,  i r 1}  is  said  to  be  essentially 
d,  dimensional  with  respect  to  ability  S.  Such  a 
trait  is  called  an  essential  trait  for  {U,,  i r 1].  (p. 

Essential  unidimensionallty  is  based  on  a weak  monotonic 

independence  is  based  on  either  a weak  monotonic  model  or  a 
nonlinear  latent  model  and  strict  local  independence, 

In  the  end,  essential  unidimensionallty  remains  a 
theoretical  conceptualization.  That  is,  departures  from 
essential  unidimensionallty  can  only  be  detected  with  the 
Stout  (1987)  procedure.  As  will  be  shown  later,  the  Stout 
procedure  is  derived  from  the  theory  of  essential 
unidimensionallty.  Because  the  definition  provides  no 
independent  criteria  for  testing  whether  a data  set  departs 
from  essential  unidimensionallty,  it  is  impossible  to  know 
the  point  at  which  even  simulated  data  are  essentially 
unidimensional  or  essentially  multidimensional. 

The  nature  of  essential  independence  is  the  source  of 
condition  presumes  that  the  number  of 


24 


infinity.  Th 

us,  the  oondition  Is  stated  as  a mathematioal 

lipit:  the  e 

spected  value  of  the  conditional  covariances  on 

the  single  tr 

ait  approaoh  zero  as  the  number  of  items 

approach  infi: 

nity.  A limit  provides  no  practical  guidelines 
f at  which  a data  set  stops  exhibiting 
dimensionality  and  starts  exhibiting 
nality.  In  other  words,  the  question  becomes 

before  they  a 

re  no  longer  considered  to  be  multidimensional, 
ult,  essential  unidimensionality  is  difficult 
lize  in  simulation  work,  because  no  one  is  sure 

.«.t  , 

iefines  such  a data  set.  So,  in  this  review  of 
ocedures  for  unidimensionality,  the  dichotomy 

between  trad!' 

tional  imidimensionality  and  traditional 

Pultidlmeneioi 
for  a ponoton: 

lality  will  be  of  interest. 

inal  if  conditional  independence  is  satisfied 

ability.  On  1 

:he  other  hand,  the  same  set  of  responses  is 

considered  to 

be  essentially  unidimensional,  if  the 

assumption  of 

essential  independence  is  satisfied. 

be  multidimens 

.ional. 

A number 

of  indices  have  been  recommended  by 

researchers  tc 

> assess  the  unidimensionality  of  scales  and 

25 

subecales.  Over  80  Indices  were  identified  by  Hattie  (19S5) 
in  an  extensive  review  of  procedures.  However,  only  a few 
were  considered  to  be  promising.  The  majority  were 
considered  to  be  undesirable  because  they  were  atheoretical 
{Hattie,  1985).  Of  these,  even  fewer  were  shown  to  be 
effective  when  applied  to  simulated  data  (Hattie,  1964). 

Since  Hattie's  (1965)  review,  a myriad  of  indices  have 
continued  to  be  reported  in  the  literature,  including  those 
recommended  by  Hattie  (1984).  Effective  procedures  are 
based  on  sound  theory  (Hattie,  1984,  1985)  and  are  sensitive 

acceptable  Type  I error  rate  (Stout,  1987).  Only  procedures 
that  met  these  criteria  were  included  in  the  review  and 

characteristics  to  greater  extent  than  others.  That  is,  the 
reported  procedures  differ  in  degree  of  theoretical  rigor, 
adherence  to  the  nominal  Type  I error  rate,  and  power.  The 

extent  of  these  differences. 

sections:  a description  of  procedures  and  a review  of  the 

empirical  literature.  The  organizational  framework  within 
each  section  includes  a distinction  between  parametric  and 

parametric  if  the  mathematical  derivation  of  its  associated 
statistics'  is  based  on  assumptions  regarding  the 


the  shape  of  the 


distribution  of  underlying  traits  or 
characteristic  curve  linking  observed  item  responses  and 
latent  traits  ( Kandakumar  & Yu,  1994). 


A number  of  procedures  are  based  on  parametric  models 
or  models  with  a specified  functional  form.  Included  in 
this  section  are  procedures  based  on  restricted  factor 
analysis,  polynomial  models,  and  full-information  factor 
analysis.  Of  these,  only  the  latter  is  included  for 
investigation  in  the  present  study. 

Mlslevy  (1986)  presented  restricted  factor  models  in 
his  review  of  latent  trait  models  for  binary  data. 
Restricted  factor  approaches  embrace  the  following:  (a)  the 
unweighted  least  squares  solution  (ULS)  which  operates  on 
one-  and  two-way  margins  (or  sample  tetrachorlc 
correlations)  of  the  2’  raw  data  where  n denotes  the  number 
of  measured  variables  or  items;  and  (b)  the  original 
generalized  least  squares  solution  (GLS)  (Christoffersson, 
1975);  and  (c)  the  modified  generalized  least  squares 
solution,  later  known  as  weighted  least  squares  (WLS) 
(Muthen,  1970,  1984).  Both  the  GLS  and  WLS  solutions 


and  two-way  margins  of  the 


approach,  which  ope: 


Unlike  the 


ates  on  phi 
on  the  joint 


The  ULS,  GLS,  end  WLS  solutions  sre  based  on 
multivariate  normal  ogive  functions,  and  are  described  as 
having  "two  tiers  of  latent  variables"  (Bartholomew,  1960, 
p.  311).  The  model,  on  which  these  approaches  are  based, 
can  be  derived  from  the  following  relationships: 

1.  Assume  m latent  variables  i,  and  p > m binary 
observed  variables  x - (xg,.  . . x.).  Furthermore,  suppose 
a set  of  p latent  variables  y underlie  the  dichotomous 
observed  response. 

2.  The  dichotomous  variable  Xj  is  related  to  yj  in  the 

following  way:  x^  s l if  r Yj 

0 if  y,  < Yj 

where  Yj  denotes  the  "threshold"  parameter. 

3.  The  relationship  between  the  two  latent  variables 
is  yj  = Iji  1|  t . . . Ij,  i,  ♦ Vj , where  Vj  is  a residual 
distributed  as  N(0,^^)  and  are  independent  over  examinees 
and  items.  The  end  result  is  a "multivariate  generalization 
of  the  two-parameter  normal  item  response"  (Hislevy,  1986, 
p.  10)  model  which  is  reported  by  Lawley  (1943)  and  Lord 
(1952)  and  expressed  as 


where  P(Xgj  =1  I ig  ) denotes  the  probability  of  a correct 
response  from  examinee  i to  item  j given  examinee  ability 
Ij,  i has  a MVN  (0,h)  distribution,  y has  a marginal 


30 


Items  are  approximately  unidimensional. 

, both  methods  should  render  similar  results,  uhen 


An  alt« 


based  on  unweighted  least 


sed  by  one  of  two  methods,  "the 


ilysis  of 


31 

Bum  of  squared  residuals,  or  sum  of  the  absolute  values  of 
the  residuals  after  rsmovlng  one"  (Hattie,  1985,  p.  146) 
factor.  This  approach  is  also  used  with  the  models, 
described  in  the  following  paragraphs. 

Approaches  based  on  GLS  and  WLS  solutions 

The  GLS  solution  for  the  estimation  of  the  factor 
analytic  model  was  first  developed  by  Christof fersson 
(1975);  Huthen  (1978)  later  modified  the  solution  to 

Both  GLS  solutions  are  an  improvement  over  those  associated 
with  the  ULS  solution. 

The  generalized  least  squares  (GLS)  solution  proposed 
by  Christoffersson  (1975,  p.  8)  minimizes  the  following 
function: 

F,„-l/2er[(S-I)iy-^)'] 


Kaplan  (1965)  it 


F,i,‘  lS-a)'tr'l£-SL) 

where  s is  a vector  of  .5[p[p  * 1)1  distinct  elements  where 
p denotes  the  number  of  elements  of  S,  e is  the 
corresponding  same  order  vector  of  d,  and  W is  a consistent 
estimate  of  the  asymptotic  covariance  matrix  of  S as 
proposed  by  Browne  (1982). 

Both  solutions  share  the  same  advantages  over  the  more 
traditional  ULS  solution.  One  of  the  primary  advantages  of 
GLS  and  WLS  solutions  is  that  an  asymptotic  chi-square 

functions.  The  fit  statistic  associated  with  either 
solution  has  a chi-square  distribution  with  .5p(p  *1)  - L 
degrees  of  freedom  where  L is  the  number  of  free  parameters 
associated  with  the  model  (Christoffersson,  1975;  Muthen, 
1978,  1964). 

Procedures  based  on  polynomial  models 

Hattie  (1985)  not  only  recommended  the  use  of 

(1975)  and  Muthen  (1978),  hut  also  recommended  approaches 
based  on  WcConald's  (1967)  model,  which  is  an  "approximation 

description  of  this  model. 

McDonald  (1967)  claimed  that  the  appropriate  regression 
function  for  binary  data  is  a s 


single 


slynomial 


33 

Model  parameters  are  estimated  through  an  unweighted  least 
squares  solution  (McDonald^  1967].  The  model  is 

where  g denotes  the  degree  of  the  polynomial  in  the  latent 
trait  Ei.e.f  linear,  quadratic,  cubic,  etc.),  i denotes  the 
latent  trait,  FiiCi)  represents  the  probability  given  i that 
an  examinee,  j,  provides  a correct  response  to  the  ith 
binary  item  (McDonald  6 Ahlawat,  1974). 

McDonald  (1967)  showed  that  the  single  trait  polynomial 
model  can  be  used  to  approximate  the  normal  ogive  model. 
Several  post  hoc  procedures  have  been  developed  to  assess 
unidimensionality  on  the  basis  of  the  polynomial  model.  The 
procedures  outlined  in  this  section  are  baaed  on  the  two 
structural  principles  delineated  in  the  preceding 
discussion.  The  form  of  the  regression  is  e polynomial 
series  model  with  second  and  third  order  terms.  The 
principle  of  weak  independence  serves  as  a proxy  for  ths 
more  rigorous  conditional  structure  principle — local 
independence  (McDonald,  1967,  1979,  1981). 

As  alluded  to  earlier,  several  traditional  indices,  all 
based  on  the  dispersion  matrix  for  a nonlinear  model,  have 
been  proposed.  Traditional  indices  are  the  sum  of  absolute 
residual  correlations  (Hattie,  1985);  the  average  of 
abeolute-valued  residual  corrslations  (Hambleton  6 
Rovinelll,  1986);  and  the  means  and  standard  deviations  of 


34 

squared  residual  correlations  (Nandakunar,  1991J.  These 
indices  are  all  attempts  to  summarize  the  off-diagonal  of 
the  dispersion  matrix,  either  through  the  sum  of  the 
absolute  values  of  the  residuals  or  sum  of  the  squared 
residuals.  The  rationale  behind  these  Indices  is  that  weak 
independence  is  satisfied  if  the  model  conforms  to  the  data. 
Therefore,  if  a single  factor  model  fits  the  data,  the 
residual  covariances  should  be  zero  or  very  small. 

However,  the  chief  drawback  associated  with  residual 
based  procedures  is  the  lack  of  a goodness  of  fit  index. 

The  lack  of  an  index  leads  to  inconsistent  Interpretations 
and  results. 

Recently,  procedures  with  a goodness  of  fit  statistic 
have  been  developed  for  use  with  the  polynomial  model  (De 
Champlain  & Gessaroldi,  1991;  De  Champlain,  1992).  Both  are 
a variation  of  a Fisher  Z transformation  of  phi  residual 
correlations.  It  is  well  established  that  the  size  of  phi 
correlations  is  dependent  on  the  distribution  of  item 
difficulties  (Lord  & Novick,  1968).  Thus,  it  can  be  assumed 
both  statistics,  which  are  derived  from  a phi  correlation, 
will  probably  also  be  dependent  on  the  distribution  of  item 
difficulties.  Because  of  this,  both  procedures  have  been 
excluded  from  further  discussion. 

Procedures  based  on  full-information  factor  analysts 

As  compared  to  factor  analysis  based  on  the  GLS 
solution,  the  full-information  factor  analysis,  as  first 


rtholome 


3d  by  numbei 


termed  by  Bar 
of  items  (N  = 60  to  N = 100)  and  incorporates  more 
information  (Hislevy,  1986).  The  marginal  maximum 
likelihood  solution  (MML),  which  was  derived  by  Bock  and 
Aitkin  (1961)  as  an  extension  to  the  unconditional  maximum 
likelihood  solution  developed  by  Bock  and  Liebecman  (1970), 
forms  the  basis  of  this  procedure. 

The  MML  procedure  is  based  on  the  assumption  that  the 
sample  has  been  drawn  from  a population  with  a multivariate 
normal  distribution  of  latent  traits,  or  1 - N(0,I).  As  a 
result,  it  is  assumed  y,  the  unobservable  "response  process" 
underlying  the  dicretised  manifest  variable,  is  distributed 

Muraki,  1968). 

Based  on  the  above  assumptions,  the  marginal 
probability  of  Xj,  the  binary  0 or  1 response  pattern  is 

PUj)=/L,(e)g(8)dB 


and  g(i)  usually  is  set  as  M(0,I),  but  can  be  set 
empirically.  The  ra-fold  Gauss-Hermite 


'ides 


P3‘T,h^X.)A{X,) 


A(X,)  is  the  corresponding  weight  (Bock  & Aitkin,  1981; 

Bock,  Gibbons,  & Huraki,  1988],  which  is  a product  formula 
for  numerical  integration.  To  solve  the  likelihood  for  item 
parameters  the  EH  algorithm  is  implemented. 

The  KML  solution  operates  upon  all  higher  order 

denotes  the  number  of  items.  It  is  based  on  the  sane  model 
as  the  generalized  least  squares  solution  — the 
multivariate  generalization  of  the  two  parameter  normal  item 

As  noted  earlier,  one  of  the  primary  advantages  with 
the  approach  based  on  a GLS  solution  is  its  concomitant  test 
of  fit.  The  MML  solution  shares  this  advantage.  It,  too, 
has  an  associated  test  of  fit.  The  statistic  is  based  on 
the  likelihood  ratio  of  fit  relative  to  a multinomial 
alternative  (Bock,  Gibbons,  6 Muraki,  1988).  The  following 
expression  shows  the  chi-square  approximation  for  the 
likelihood  ratio  test  of  fit: 


= 2j^r,ln( 


37 

the  item  parameters;  represents  the  frequencies  of  the 
response  patterns,  Xj  for  j items  for  H examinees;  and  the 
decrees  of  freedom  are  Z‘  - n(ro  t 1)  t n(m  - l)/2  {Bock, 
dibbons,  6 Huraki,  1988,  p.  265)  where  n is  the  number  of 

Bock,  Gibbons,  and  Huraki  (1988)  suggested  that  to 
determine  model  fit  the  chi-square  should  be  performed  each 
time  a factor  is  added  with  the  procedure  ending  once  the 
chi-square  is  no  longer  significant.  The  end  model  should 

g’  difference  statistic  examined  in  this  study,  which 
follows  a distribution. 

Unlike  GLS  solutions,  the  MHt  can  be  corrected  for 
guessing.  Bock,  Gibbons,  and  Huraki  (1988)  provided  a 
correction  for  the  effects  of  guessing,  which  is  based  on 
the  correction  developed  by  Carroll  (1945). 

Honoarametric  Procedures 

Two  procedures  have  been  classified  as  nonparametric: 
the  HP  and  ST  procedures.  Both  have  been  Included  for 
investigation  In  the  present  study. 


The  Holland  and  Rosenbaum  (HR)  procedure  (Holland  & 
Rosenbaum,  1986;  Rosenbaum,  1984)  has  been  recommended  as  a 
viable  nonparametric  alternative  (Zwick,  1987)  for  the 
assessment  of  unidimensionality.  The  procedure  is  based  on 


38 

the  asBimption  of  conditional  independence  and  aonotonicity 
(CISK).  "The  condition  of  latent  conditional  independence 
states  that  X,,.  . . ,Xj  are  conditionally  independent  given 
U or 

ic>|u)=^Pj(x3|u) 


ilXup 


39 


covariance  for  each  of  the  n(n  - l)/2  pairs  of  items  where  N 
denotes  the  nuinber  of  items,  a Hantel-Haenszel  statistic 
[Mantel  & Kaenszel,  1959)  is  used  [ Holland  & Rosenbaum, 
1986;  Rosenbaum,  1984).  The  procedure  is  as  follows: 

1.  Calculate  the  hypergeometric  mean  and  variance 

“l)k  ' S'"1U>  = 

Vlnju)  = n,.,n2*)m,nn,j|,  /n*,„/(n,.j-l ) 

Items  #1  and  *2  correctly;  Oj,j  denotes  n 
who  answered  item  #1  correctly;  n.i,  deno 


and  test  lengths  of  equal  to  or  greater  than  SO  (Nandakunar 
6 Stout,  1993;  Nandakumar  6 Yu,  1994).  Either  statistic  is 
designed  to  test  the  null  hypothesis  of  essential 
unidimensionality,  or  d,  = 1,  versus  the  alternative 
hypothesis  of  departures  from  essential  unidiaensionality, 


The  Stout  (19B7)  prooedure  was  devised  for  "small"  test 
administrations  (J  s 2,000)  with  modifications  for  when  J > 
2,000.  To  implement  this  procedure.  Stout,  Nandakumar, 
Junker,  Chang,  and  Steidinger  (1991)  developed  the  software 
package  DIKTEST.  Because  sample  sizes  for  this  simulation 
study  will  be  limited  to  500,  1,000,  and  1,500,  a 
summarization  of  Stout's  (1987)  procedure,  as  implemented  by 
DIHTEST  for  small  test  administrations,  is  presented  in  the 
following  paragraphs. 

1.  Split  the  test  into  two  short  subtests  and  one  long 


compute  the  two  variance  estimates  used  to  estimate  the 
Stout  statistics.  As  will  be  seen  later,  the  items  of  ATI 
should  be  homogeneous:  this  can  be  achieved  through  one  of 


sthod! 


(b)  a 


suiteBt,  and  (b)  the  factor  loading  cutoff  for  determining 
which  items  best  measure  the  second  factor.  The  combination 
of  both  factors  Influence  the  size  of  the  observed  Type  1 
error  rate  (Nandakumar  & Stout,  1993).  With  this  io  mind, 

advantageous  for  the  size  of  ATI  and  AT2  to  be  determined 
automatically  for  most  applications.  An  algorithm  was 
designed  that  automatically  fixes  the  size  of  ATI  and  AT2  as 
a function  of  the  magnitude  of  second  factor  item  loadings 
and  test  length.  Simulation  work  has  indicated  that  the 
maximum  number  of  items  in  the  subtests  should  be  .25  of  the 
test  length,  and  factor  loadings  should  be  at  least  .15 
before  using  them  to  assign  items  (Nandakumar  St  Stout, 

1993).  For  more  details,  see  Nandakumar  and  Stout  (1993). 

The  second  set  of  M items  for  AT2  should  be  selected 
from  the  remaining  N-M  items  to  "have  an  item  difficulty 
distribution  as  similar  as  possible  to  that  of  ATI"  (Stout, 
1987,  p.  595).  After  selecting  ATI  and  AT2,  the  remaining  n 

place  examinees  into  k subgroups. 

2.  Place  examinees  into  k subgroups.  Before  assigning 
examinees  to  subgroups,  exclude  those  receiving  perfect  or 
zero  partitioning  test  (PT)  scores.  On  the  basis  of  PT 


represent  the 


ATI  score  of  examinee  j from  subgroup  k,  and 


represent  the  mean  ATI  score  for  subgroup  k.  Then  the  usual 
variance  estimate  for  group  k is 


where  denotes  the  (0,1)  response  of  examinee  j to  item  i 
for  subgroup  K,  J|  denotes  total  sample  sise,  and  M denotes 
the  length  of  the  subtest  ATI  (Stout,  1987). 

The  next  set  of  expressions,  taken  from  Stout  (1987) 
and  Nandakumar  (1991)  show  how  to  estimate  the 
unidimensional  variance  estimate.  Let, 


represents  the  unidimensional  variance  estimate  for  subgroup 
k,  where  J denotes  the  number  of  examinees  from  subgroup  k, 
and  U denotes  the  response  of  the  jth  examinee  to  the  ith 
item  for  subgroup  k (Stout,  1987). 

4.  Normalize  and  pool  variance  estimates  across 
subgroups  to  form  for  ATI.  The  aim  of  this  step  is  to 
produce  the  proper  standard  error  for  the  T index. 


represent  item  difficulty.  Then 


Originally,  Stout  (19B7)  proposed  the  following,  which  la 
the  standard  error  associated  with  T,. 


«*.»  = <1  - A'*’)  <1  ' 


si  - S-— 


if* 


where,  3 denotes  the  number  of  examinees  from  subgroup  k, 
and  denotes  the  original  standard  error  for  Tj  (Stout, 
1987,  p.  594). 

Nandakunar  and  Stout  (1993)  reported  that  the  sampling 
variance  of  Tj  resulted  in  an  overly  conservative  Type  1 
error  rate.  As  a result,  Nandakumar  and  Stout  (1993) 
proposed  a corrected  estimated  standard  error,  which  has 
been  incorporated  in  DIHTEST,  as  T (Stout,  et  al.,  1991). 
The  following  expressions  lead  to  the  corrected  estimated 


standard 


^ A'"  (1  - A'")  (1  - zisf')’ 


resulting  in  the  following  statistic 


fTon  k subgroups  of  examinees  (Nandakumar  & Stout,  1993). 

To  facilitate  the  comparison  of  the  performance  of  the 
procedure  before  and  after  modifications,  Nandakumar  and 
Stout  (1993)  designed  a study  to  emulate  an  earlier  study 
(Stout,  1987)  in  which  the  original  procedure  was 
Investigated.  (The  Stout  (1987)  study  is  discussed  in 
greater  detail  in  subsequent  sections.)  Both  unidimensional 
and  two  dimensional  data  sets  were  generated  with  100 
replications  per  condition.  For  the  moat  part,  the  modified 
procedure  showed  improvement  across  all  conditions;  the 
results  indicated  that  the  observed  rejection  rate  of  the 
modified  procedure  was  closer  to  the  nominal  Type  I error 
rate  for  the  null  condition.  Not  only  did  the  modifications 


they  alt 


improvement  under  the  multidimensional  condition.  The 
procedures  after  adjustments  increased  in  power  across  all 
conditions. 

5.  Compute  the  statistic  Tg  for  bias  correction.  As  a 
correction  for  bias  generated  by  examinee  variability  and 
item  bias,  Stout  (19B7)  proposed  the  Tg  statistic.  Bias 
stemming  from  examinee  variability  becomes  a problem  with 
short  tests.  A short  test  is  less  reliable  than  a longer 
test.  As  a result,  when  a shorter  test  Is  administered,  the 
scores  of  examinees  in  fixed  subgroups  are  more  likely  to 
vary.  As  a result,  local  independence  is  violated,  which 
results  in  an  inflated  statistic  (Stout,  1987).  This  bias 
is  exacerbated  if  the  subtest  is  comprised  of  items  of 
similar  difficulty. 

To  compute  Tg  Nandakumar  (1991)  suggested  repeating 
steps  3 and  4 using  AT2  items.  This  should  result  in  a 
correction  for  both  types  of  bias. 

6.  Test  the  null  hypothesis,  dg  = 1 versus  the 
alternative  hypothesis,  d,  > 1.  The  general  form  for  either 
Stout  statistic  T is  expressed  by 


[Nandakumar,  1991,  p.  115) 

The  decision  rule,  according  to  Stout  (1967),  is  "reject  Hji 
dj  s 1 if  T > Z-,  where  Z-  is  the  upper  100(1-  ') 
percentile  for  a standard  normal  distribution,  ' being  the 


The  perforae 


model  estimation, 
ingredients:  pro] 
efficient  estimat: 
assumptions.  All 


the  accuracy  i 
insure  this,  ne 


should  be 


54 

calculate  the  RDl  Is  theoretically  superior  to  other  methods 
used  to  calculate  other  similar  indices,  Hoznowskl,  Tucker, 
and  Humphreys  (1991)  have  demonstrated  through  a monte  carlo 
study  that  the  RDI  performs  inconsistently,  They  compared 

differences  (RDl)  with  a local  independence  index  (Lll)  and 
a pattern  index  (PI).  The  latter  two  indices  were  developed 
by  Tucker  and  Humphreys,  respectively.  The  performance  of 
all  three  indices  were  effected  by  sample  size,  the  size  of 
factor  intercorrelations,  and  test  length  (Roznowski, 

Tucker,  & Humphreys,  1991).  The  RDI,  however,  was  also 
influenced  by  additional  factors:  the  distribution  of  item 

difficulties  and  the  number  of  factors.  In  fact,  they  found 
that  if  item  difficulties  vary  widely,  the  use  of  the  RDI 
based  on  tetrachorics  should  be  avoided  unless  sample  size 
exceeds  2,000.  In  line  with  this,  Roznowski,  Tucker,  and 
Humphreys  (1991)  concluded:  "Previous  research  points  to 

one  index  that  can  be  rejected  without  reservation  .... 
That  index  is  the  dltferenoe  in  size  of  eigenvalues  of  the 
first  two  principle  factors,  and  it  is  not  recommended  for 
use"  (p.  124). 


In  all,  it  is  clear  thj 

stated  "overall,  linear  fact 
tetrachorics  and  using  elthe 


n fact.  He 
c analysis 


!S  based  on  ULS 
ittie  (19B4,  p.  73) 
: [based  on  phi  or 


not  appropriate  for 


ition] 


55 

deternining  vmidimensioxiallty. " As  a result,  these 
procedures  will  receive  no  further  mention  and  are  not 
included  in  the  present  study. 

GLS  and  WLS  procedures 

Approaches  based  on  the  generalized  least  squares 
solution  Include  the  evaluation  of  the  dispersion  matrix  and 
the  use  of  a chi-square  statistic  of  fit  or  a chi-square 
difference  statistic.  Both  are  improvements  over  the 
procedures  associated  with  the  unweighted  least  squares 
solution,  but  are  not  without  shortcomings.  The  issues 
associated  with  the  use  of  fit  statistics  are  first 
addressed.  This  is  followed  by  a review  of  the  literature 
about  procedures  that  rely  on  the  use  of  residuals. 

As  compared  to  the  unweighted  least  squares  solution, 
the  generalized  least  squares  solution  has  two  major 
advantages.  Because  the  generalized  least  squares  solution 
incorporates  standard  errors,  an  asymptotic  chi-square  fit 
statistic  is  available  for  both  the  GLS  and  WLS  solutions. 
Hattie  (1965)  in  his  review  touted  the  fit  statistics  as 
promising,  but  neither  were  included  in  his  monte  carlo 
study  (Hattie,  1984).  Others  have  been  less  optimistic. 

For  instance,  Christoffersson  (1975,  p.  8)  warned  "this  chi- 
square  has  to  be  used  with  caution,"  because  the  magnitude 
of  its  value  is  a function  of  sample  size,  with  larger 
sample  sizee  resulting  in  a higher  incidence  of 
significance.  This  cautionary  note  applies  to  the  weighted 


<eU  (Huthe 


1978; 


56 


least  squares  solution  (ULS)  as  w 
light  of  this,  Mislevy  (1966)  rec 
differences  should  be  used  to  com 


slevy  (1986,  p. 


the 

15) 


The  difference  between  the  chi-square  for  a m factor 
and  a m t 1 factor  solution  for  the  same  data  also 
follows  a chi-square  distribution  when  the  m-factor 
model  is  correct,  with  the  degrees  of  freedom  equal  to 
the  number  of  additional  parameters  estimated  in  the 
less  restrictive  solution. 

The  primary  shortcoming  of  the  WLS  and  GLS  solutions  for  the 
assessment  of  unidimansionality  is  both  are  computationally 
intensive.  As  a result,  applications  of  the  GLS  solution 
are  limited  to  data  sets  with  fewer  than  25  items 
(Christoffersson,  1975).  The  computer  software  originally 
designed  for  the  estimation  of  the  Muthen's  (1979)  HLS 
solution,  FABIV  (Andersson,  Christoffersson,  6 Huthen, 

1974),  and  the  more  recently  designed  LISCOMP  (Kuthen, 

1987),  are  both  restricted  to  less  than  40  items  (Hattie, 
1984;  Nanda)(umar,  1994). 

Such  a restriction  poses  a serious  limitation  for 
practical  applications  of  the  procedure.  In  the  literature 
a number  of  tests  are  cited  that  either  have  or  exceed  40 
items.  For  instance,  the  SAT  verbal  has  80  items  (Lord, 
1968):  the  Mathematics  Usage  Test  from  the  Enhanced  Act 
Assessment  Program  (EAAP)  has  60  items  (Ac)ierman,  1990);  the 
CTBS/U  mathematics  concepts  and  applications  test,  level  S, 


1984);  and  the  Englis 


57 

the  ACT  Aseessnent  Program  has  75  items  (Drasgow,  1987; 
Woodruff,  1990). 

The  chi-square  statistic  is  not  the  only  procedure 
based  on  the  generalized  least  squares  solution.  An 
alternative  is  the  examination  of  the  dispersion  matrix  of  a 
one  faotor  model  estimated  using  the  GLS  solution.  Based  on 
the  findings  of  a monte  carlo  study,  Hattie  (1984,  1985) 
believed  such  an  approach  holds  promise,  because  it 
performed  effectively  and  was  theoretically  sound.  In  a 
monte  carlo  study,  he  examined  the  sum  of  absolute  residuals 
from  the  FACIV  procedure,  as  well  as  other  procedures. 

In  his  study  he  examined  a number  of  indices,  which 
were  categorized  on  the  basis  of  (a)  answer  patterns,  (b) 
reliability,  (c)  component  or  factor  analysis,  and  (d) 
latent  trait  models.  The  completely  crossed  design  was 
comprised  of  36  conditions,  each  replicated  24  times:  (a) 

two  sets  of  difficulty  parameters  [ ( -2, -1, 0, 1 , 2 ) and  (-1  - 
.5,0,  .5,  1)];  (b)  three  sets  of  guessing  parameters 
[(0,0,0),  (.2, .2, .2),  and  (.0,.1,.2)];  and  (c)  six 
dimensionality  conditions  (i.e.,  1-factor,  discrimination 
mixed:  1-factor,  all  discriminations  equal;  2-factor, 

5-faotor,  intercorrelatlone  of  .1;  S-factor, 
intercorrelations  of  .5).  Sample  size  was  fixed  at  500  and 
test  length  was  fixed  at  15. 


iltidlmenslonal  d< 


S9 

testing  situations.  Although  Hattie  (1984)  acltnowledged 
this  limitation  in  his  discussion  of  the  findings,  the 
design  of  his  study  precluded  it  from  being  addressed 
empirically,  because  he  limited  test  lengths  to  IS  items. 

In  sum,  although  procedures  based  on  the  GLS  solution 

they  still  cannot  be  recommended  for  the  assessment  of 

restricted  to  test  lengths  of  less  than  40  items. 
Furthermore,  the  procedure  based  on  the  evaluation  of  the 
residuals  is  unsatisfactory,  because  it  has  no  established 


investigate  the  performance  of 


may  be  worthwhile  to 
the  chi-sguare  statistic. 


ilogies  should  bring  improved 


the  chi-sguare  statistic  associated  with  the  full- 
information  factor  analysis  may  be  the  better  choice  of  fit 
statistic,  because  it  has  no  restrictions  on  test  length. 


62 

accounted  for  only  two  to  three  percent  of  the  variance.  In 
fact,  she  found  that  the  chi-squares  from  the  simulated  data 
were  larger  than  those  of  the  actual  NAEP  data.  To  explain 
this,  she  stated  "given  that  the  application  of  a 
unidinenaional  normal  ogive  model  to  the  data  yielded  a poor 
fit,  the  most  likely  explanation  for  the  difference  in  size 
of  the  test  statistics  is  the  dependence  of  chi-sguare 
statistics  on  sample  size  in  the  non-null  case"  (p.  304- 
305). 


on  the  evaluation  of  eigenvalues  associated  with  FIFA.  The 
chi-square  difference  statistic  associated  with  the  FIFA  and 
the  evaluation  of  the  eigenvalues  associated  with  the 
principle  components  resulted  in  an  overestimation  of  the 
number  of  factors.  Finally,  the  HR  procedure  exhibited  an 
overly  conservative  rejection  rate. 

Although  Zwlck's  (1987)  results  concerning  the  FIFA 
procedure  were  not  encouraging,  the  performance  of  the  FIFA 
procedure  under  the  nonnull  condition  is  still  an  open 
question,  particularly  under  nonnormal  or  missing  data 
conditions.  Furthermore,  her  study  included  no  replications 

Therefore,  the  geoeralizability  of  this  study  is  somewhat 
limited  to  those  fixed  conditions  and  the  two  particular 
data  sets  generated  for  the  study. 


>lvnonlal  model 


the 


65 

unclear  what  criterion  should  be  used  to  determine  whether 
the  residual  values  are  sufficiently  low.  Hambleton  and 
Rovlnelli  used  the  residuals  from  well  fitting  linear  factor 
analysis  models  as  criterion,  but,  as  they  admitted,  this 
criterion  is  not  adaptable  to  real  data  sets  where  It  would 

Furthermore,  they  pointed  out  that  it  is  difficult  to 
determine  whether  terms  emceeding  the  quadratic  order  should 
be  included  in  the  model. 

The  results  of  the  Hambleton  and  Rovlnelli  (19BB) 
study,  however,  should  be  interpreted  with  caution.  That 
is,  the  structure  of  the  data  sets  generated  by  Hambleton 
and  Rovlnelli  did  not  approach  the  complexity  typically 
found  in  real  response  sets,  particularly  in  terms  of  item 
distribution  across  factors.  The  reason  for  this  is  that  a 
simplified  version  of  the  Christoffersson-Hattie  model 
(Christoffersson,  1975;  Hattie,  1981)  was  used  for  data 

only  one  dimension.  In  other  words,  no  item  was  associated 
with  both  dimensions.  More  typically,  each  item  Is 
associated  with  more  than  one  factor,  even  if  that  one 
factor  is  the  dominant  factor  (Humphreys,  1965;  Stout, 


For  instance,  Reckase  (1985)  provided  empirical 
evidence  that  items  are  often  associated  with  more  than  or 
factor.  He  conducted  two-dimensional  analysis  of  the  ACT 


HAXLOG 


prosram  (McKinley  & 
Reckase,  1983),  and  calculated  the  direction  cosines  for 
each  of  the  forty  iteos.  The  direction  cosine  associated 
with  a particular  item  gives  an  indication  to  what  extent 
the  item  measures  each  dimension.  For  instance,  if  an  item 
has  a direction  cosine  of  1.00,  then  it  only  measures  the 
first  dimension;  whereas  if  an  item  has  a direction  cosine 
of  0 it  measures  only  the  second  dimension.  Therefore,  for 
any  values  between  0 and  1.00,  the  item  measures  both 
dimensions.  Reckase  (1985)  found  that  just  a little  more 
than  half  of  the  items  had  direction  cosines  ranging  between 
.80  and  .21,  which  indicates  that  little  over  half  of  the 
items  were  measures  of  two  items.  "Mathematical  'story 
problems'  are  the  most  common  example  of  this  type  of  item; 
both  mathematic  and  verbal  skills  are  required  to  obtain  a 
correct  answer"  (Reckase,  1985,  p.  401). 

associated  with  the  use  of  traditional  nonlinear  procedures. 

fit.  In  line  with  this,  Hattie  (1984,  p.  75)  suggested  that 
"further  research  is  needed  before  specific  cutoff  points 
can  be  recommended  below  which  it  can  be  concluded  that  a 
set  of  items  is  unidimensional. " Second  is  the  difficulty 
in  determining  the  number  of  polynomial  terms  to  include  in 
the  model.  In  a later  paper,  Nandakumar  (1994)  voiced 


she  found 


68 

inflatefl  Type  I error  rate  for  the  null  condition  (Zwick, 
1967).  However,  the  generalizahillty  of  these  findings  is 
limited  to  the  two  data  sets  included  in  the  study;  no 
replications  were  completed.  Although  it  is  suspected  the 
power  of  the  procedure  will  also  be  high;  it  is  not  Known 
whether  this  will  be  the  case  under  various  conditions,  such 
as  when  the  assumption  of  normal  latent  traits  is  violated. 
Nonparametrlc  Procedures 

Thus  far,  the  focus  of  this  review  has  been  on  the 
empirical  findings  associated  with  parametric  procedures. 

Now  the  focus  will  shift  to  the  nonparametrlc  procedures. 

The  performance  of  two  procedures  will  be  examined;  the 
Stout  T procedure  (1987),  which  based  on  the  assumption  of 
essential  unidimensionality;  and  the  Holland-Rosenbaum 
procedure  (Holland  S Rosenbaum,  1986;  Rosenbaum,  1984), 
which  is  based  on  the  traditional  assumption  of 
unidinensionallty. 

The  ST  procedure 

The  Stout  T procedure  has  been  fully  examined  under  a 

Champlain  & Tang,  1993;  Nandakumar,  1991,  1994;  Nandakumar  & 
Stout,  1993;  Stout,  1987).  Following  the  development  of  the 
Stout  procedure.  Stout  (1987)  conducted  a monte  carlo  study 
to  determine  whether  the  observed  Type  I error  rate  adhered 


(1987; 


69 

generated  both  one-  and  two-dimensional  data  sets.  The 
correlations  between  the  two  dimensions  were  fixed  at  .5  and 
.7,  which  are  considered  to  be  moderate  to  high  for  two- 
dimensional  data  sets.  Items  were  distributed  across  the 
two  dimensiooe  in  various  ways:  under  some  conditions  all 

items  were  assigned  equally  to  both  dimensions;  whereas 
under  other  conditions  the  majority  of  Items  were  assigned 
to  one  dimension,  and  the  remaining  minority  were  assigned 
to  both  dimensions.  The  three  sample  sizes  were  750,  2,000, 
or  20,000.  Reports  of  item  parameters  associated  with 
actual  tests  guided  the  selection  of  item  parameters  for 
both  unidimensional  and  multidimensional  3-parameter 
logistic  item  response  models,  but  they  were  not 
systematically  controlled. 

The  results  of  Stout's  (1987)  study  indicated  that 
under  the  strictly  unidimensional  condition,  the  rejection 
rates  were  less  than  or  equal  to  3%  at  the  0.01  significance 
level,  they  were  less  than  or  equal  to  7%  at  the  .05 
significance  level,  and  they  were  less  than  or  equal  to  17% 
at  the  0.10  significance  level.  Furthermore,  at  the  0.05 
significance  level  the  rejection  rates  averaged  2.16%  across 
all  conditions,  which  may  suggest  that  the  Stout  T^ 
statistic  is  somewhat  conservative.  The  performance  of  the 
statistic  is  not  invariant  across  sample  sizes.  At  the  0.05 
significance  level  rejection  rates  averaged  1.4%  for 
examinee  size  750,  whereas  rejection  rates  averaged  3%  for 


test  becomes  less  conservative. 

Moreover,  the  Stout  statistic  was  shown  to  exhibit 
moderate  to  poor  power  when  subjected  to  conditions 
associated  with  two  dimensional  data  sets.  Rejection  rates 
decreased  with  a decrease  In  sample  sizes  and  with  an 

was  more  pronounced  when  the  lower  asymptote  (c)  was  0.2 
than  when  it  was  0.0.  That  is,  for  sample  sizes  of  750  when 
c • .2  at  the  .05  significance  level  the  rejection  rate 
decreased  from  62  to  36%  as  the  correlation  jumped  from  .5 
to  .7.  Increased  sample  size  helped  to  offset  this  trend. 
Although  rejection  rates  for  sample  sizes  of  2,000  at  the 
0.05  significance  level  also  dropped  as  the  correlation 
increased,  the  drop  in  rejection  rates  was  of  smaller 


The  Stout  statistic  exhibits  moderate  to  good  power 
when  sample  sizes  exceed  or  are  equal  to  2,000.  The 
distribution  of  item  loadings  across  dimensions  does  not 
seem  to  adversely  affect  power.  However,  the  presence  of 
guessing,  c ° 0.2,  as  well  as  the  presence  of  highly 
correlated  dimensions,  r = .7,  appears  to  reduce  rejection 


itlstic 


iminal  level 


74 

problems  for  the  generalisabillty  of  the  Nandakumar  (1994) 
results  beyond  the  data  sets  generated  specifically  for  the 
study.  Without  replications,  it  is  uncertain  whether  the 
effects  are  the  result  of  the  explanatory  variables  or  the 
result  of  sampling  fluctuations. 

As  mentioned  earlier,  one  potential  shortcoming 
associated  with  the  Stout  T procedure  is  that  it  may  not  be 
robust  to  departures  from  normality,  though  the  findings  are 
mixed.  De  Champlain  and  Tang  (1993)  investigated  the 

a normal  distribution  with  aero  skewness,  and  negatively  and 
positively  skewed  distributions,  each  with  -1,75  and  .750 
skewness,  respectively.  For  data  sets  with  both  negative 
and  positive  skew,  the  rejection  rates  exceed  the  nominal 
level  of  .05.  In  fact,  across  all  combinations  of  sample 
size  (J  = 500  and  1,000)  and  test  length  (N  * 20  and  40)  the 
statistic  consistently  performed  poorly  for  the  negatively 
skewed  data.  The  findings  of  De  Champlain  and  Tang  (1993) 
suggested  that  the  test  appears  to  be  liberal  for  skewed 
distributions.  De  Champlain  and  Tang  (1993)  attribute  the 
poor  performance  of  the  statistic  to  the  effects  of  skewed 
data  on  the  implementation  of  the  principle  axis  factor 


Kandakumar  and 
and  Tang's  (1993)  ce 


(1994)  disagreed  with  De  Champlain 
antion.  They  provided  evidence  that 


Tj  statistic  was  affected  by  the 


iiditional 


bee: 


posited  as  a potential  source  of  conservatism, 
conditional  scores  and  latent  variable  have  a less  than 
perfect  relationshop  (Nandakuoar,  1994). 

Only  a few  researchers  have  examined  the  performance  of 
the  HR  procedure.  Ben-Simon  and  Cohen  (1990)  examined  the 

combinations  of  sample  size  (J  = 1,000,  2,000,  3,000,  4,000) 
and  test  lengths  (N  = 20,  30,  40,  50). 

The  combination  of  large  sample  sizes  and  short  test 
lengths  for  unidimensional  data  resulted  in  larger  means  and 
standard  deviations  than  found  in  the  standard  normal 
distribution.  The  end  result  was  an  infrequent  rejection  of 
the  null  hypothesis  (Ben-Simon  St  Cohen,  1990).  Similarly, 
sample  size  and  test  length  affected  the  sensitivity  of  the 
statistic  under  the  nonnull  condition.  But  the  direction  of 
the  effects  in  terms  of  sample  size  was  reversed.  In  this 
case,  the  combination  of  short  test  lengths  and  small  sample 

rejection  rates.  So,  when  multidimensionality  is  present, 
large  sample  sizes  and  long  tests  should  lead  to  increased 
rejection  rates.  However,  even  under  the  nonnull  condition 
the  test  remains  overly  conservative.  As  a result,  all 
multidimensional  data  sets  were  misclassif led  as 
unidimensional. 

In  sum,  Ben-Simon  and  Cohen  (1990)  found  that  for 
unidimensional  data  sets  the  procedure  performed  better  for 


77 


a combination  of  small  sample  si 

zee  and  long  test  lengths, 

and  for  multidimensional  data  se 

ts  the  procedure  performed 

better  for  a combination  of  larg 

e sample  sizes  and  long  test 

sngttiB.  It  remains  an  open  question  whether  this  is  indeed 


the  case;  because  with  only  two 

replications,  it  is 

difficult  to  determine  whether  t 

hese  trends  should  be 

combinations  or  whether  they  sho 
sampling  fluctuations. 

uld  be  attributed  to 

Recall,  Handakumar  (1994)  e 

sanined  the  performance  of 

the  Holland-Rosenbaum  procedure 

under  both  null  and  nonnull 

conditions  for  a sample  size  of 

2,000,  test  lengths  of  25 

and  SO,  and  inter-trait  correlat 

ions  of  .3  and  .7.  In  line 

with  Ben-Simon  and  Cohen's  (1990 

) results,  Nandakumar  (1994) 

found  the  Hantei-Haenzeel  statistic  to  be  conservative  for 
the  null  condition.  However,  in  contrast  to  Ben-Sioon  and 


Cohen's  findings,  Nandakumar  fou 

nd  the  statistic  readily 

detected  departures  from  unidime 

nsionality  for  factor 

combinations  of  large  samples  (J 

= 2,000)  and  short  test 

lengths  [N  = 25)  or  large  sample 

s (J  = 2,000)  and  long  test 

lengths  (N  = 50)  for  a low  inter 

-trait  correlation  of  .3. 

rform  differently  for 


the  null  condition.  The  has  an  inflated  Type  I error 
rate  for  a combination  of  small  sample  sise  (J  < 1000)  and 
long  test  length  (N  > 40),  whereas  tends  to  have  a 


The  power  of  both  statistics  appears  to  be  affected  by 
a combination  of  test  length  and  inter-trait  correlation. 

As  the  test  length  decreases  and  the  inter-trait  correlation 
increases,  the  power  of  either  statistic  decreases.  (Stout, 
1987;  NandaXumar  6 Stout,  1993)  However,  the  affect  of  the 
interaction  of  test  length  and  inter-trait  correlation  is 
more  substantial  for  the  more  conservative  statistic  than 
for  the  more  powerful  statistic  (Nandahumar  & Stout,  1993). 

Furthermore,  it  appears  that  both  statistics  are  robust 
to  slight  to  moderate  violations  of  normality,  as  long  as  a 

level  of  the  partitioning  test  (FT)  (De  Champlain  6 Tang, 
1993;  NandaJtumar  6 Yu,  1994).  But  the  findings  suggested 
that  the  behavior  of  the  Tj  statistic  is  less  consistent 
than  that  of  T^  when  smaller  numbers  of  examinees  are 
assigned  at  each  score  level.  It  still  remains  unclear  what 
the  affect  of  more  extreme  skewness  is  on  the  performance  of 
the  two  statistics  for  either  the  null  or  nonnull  case. 

The  Holland-Rosenbaum  (Holland  8t  Rosenbaum,  1986) 
procedure  has  received  considerably  less  attention  than  the 


79 

Stout  T procedure.  At  this  point,  findings  have  indicated 
that  the  performance  of  the  HR  procedure  appears  to  exhibit 
an  overly  conservative  rejection  rate  for  the  null  case  for 
snail  sample  sizes,  and  it  appears  to  exhibit  reasonable 
rejection  rates  for  large  samples  and  long  test  lengths, 
especially  when  the  inter-trait  correlation  rate  is  .3  (Ben- 
Siraon  & Cohen,  1990;  Nandakumar,  1994).  Because  studies 
conducted  to  date  concerning  the  HR  procedure  have  included 

whether  the  observed  Type  1 error  rate  adheres  to  the 
nominal  rate,  and  whether  the  procedure  displays  adequate 

Conclusions 

Since  the  Hattie  (1964,  1985)  papers,  a number  of 

developed.  Many  of  these  are  both  theoretically  sound  and 
perform  adequately  for  either  the  null  or  nonnull  cases.  Of 
the  parametric  procedures  FIFA  was  the  only  one  selected  for 

of  FIFA  is  that  Its  associated  chi-square  difference  has 
been  shown  to  be  overly  sensitive  to  departures  in  model 

with  60  items  or  more.  Little  is  known  about  the 
performance  of  the  FIFA  for  multidimensional  data  sets  and 
nonnormal  data  sets. 

Both  nonparametric  procedures  have  been  chosen 


for  the 


present  stufly:  the  ST  procedure  and  the  HR  procedure.  The 

type  1 error  rate  of  the  Stout  T procedure  has  been 
established  for  various  sample  sizes  and  test  lengths.  It 
has  been  shown  the  Tp  may  be  inflated  for  long  tests  and 
small  samples,  and  that  both  statistics  are  unaffected  by 
low  to  moderate  skewness.  It  has  been  suggested,  however, 
that  both  statistics  perform  poorly  when  too  few  examinees 
have  been  assigned  to  a particular  score  level  of  the 
partitioning  test.  This  is  particularly  a danger  for  long 
tests  and  small  sample  sizes. 

The  power  of  the  Stout  T procedure  has  also  been 
examined  rather  extensively.  However,  the  findings  are,  for 
the  most  part,  limited  to  data  sets  that  have  a strong  form 
multidimensionality  and  are  normally  distributed.  Under  the 
preceding  conditions,  the  procedure  exhibits  good  to 
excellent  power  for  various  test  lengths  (N  = 25,  50), 
sample  sizes  (J  = 750,  2,000),  and  inter-trait  correlations 
(r  = .3,  .5,  .7). 

Less  is  known  about  the  behavior  of  the  second 
nonparametric  procedure,  the  Holland-Rosenbaum  procedure. 

function.  In  fact,  for  the  null  condition  the  statistic 
performs  best  for  small  samples  and  long  tests  and  for  the 
nonnull  condition  it  performs  best  for  large 


sainples 


correlations. 


moderately  low  inter-t 


CHAPTER  3 
KETHODS 


unidimensionality  have  shown  to  be  promising:  the  Stout  T 

(ST)  procedure  (1987);  the  full-information  factor  analytic 
(FIFA)  procedure  (Zuick,  1987);  and  the  Holland-Rosenbaum 
(HR)  procedure  (Holland  i Rosenbaum,  1986;  Rosenbaum,  1984), 
Yet,  no  one  approach  is  without  problems.  It  has  been 
established  through  earlier  studies  that  these  approaches 
are  sensitive  to  departures  from  unidimensionality.  The 
degree  of  sensitivity,  however,  is  somewhat  dependent  on 
sample  size,  test  length,  and  the  magnitude  of  the  inter- 
trait  correlation.  Furthermore,  it  is  suspected  that  the 
power  of  these  Indices  may  also  be  dependent  on 
distributional  characteristics  (De  Champlain  A Tang,  1993; 
McDonald,  1991;  Nandakumar  & Yu,  1994).  The  purpose  of  this 
study  is  to  ejcamine  the  effects  of  sample  size,  test  length, 
distributional  characteristics,  magnitude  of  inter-trait 
correlations,  and  their  interactions  on  the  power  of  the  ST 


FIFA  procedi 


id  the 


The  DesiQfi 


Design  of  the  Study 


The  pouei  of  each  procedure  was  determined  by  the 
percent  of  correct  rejections  of  the  null  hypothesis  of 
unidimensionality  for  two-dimensional  simulated  data  sets. 
Rejection  rates  were  examined  for  each  procedure  across  four 
factors:  sample  size  (J),  test  length  (N),  inter-trait 
correlation  (r),  and  distribution  type  (DT).  A description 
of  each  factor  is  provided  in  the  following  discussion, 
-tamnla  Size  fJI 

Sample  sizes  were  500,  1,000,  and  1,500.  The  selection 
was  motivated  by  the  need  to  reflect  the  size  of  typical 
item  response  data  sets  ss  found  in  practice  and  in  the 
literature  and  to  provide  an  adequate  description  of  the 
power  curve.  It  was  predicted  that  the  power  of  the 
selected  procedures  would  be  high  for  samples  sizes  at  or 
exceeding  2,000,  resulting  in  exposure  of  only  the  upper  end 
of  the  power  curve,  which  is  relatively  flat.  To  ensure 
inclusion  of  the  Inflection  point  of  power  curve,  the  point 
at  which  the  most  change  occurs,  smaller  sample  sizes  were 
selected,  resulting  in  a more  informative  description  of  the 

A sample  size  of  500  is  insufficient  for  the  estimation 
of  stable  item  parameters  for  the  three  parameter  logistic 
model  but  is  sufficient  for  the  one  or  two  parameter  models 
(Hulin,  Lissak,  & Drasgow,  1982).  The  remaining  two  sizes 


84 

in  stable  estimates  for  one,  two,  or  three  parameter 

It  has  been  shown  that  the  power  of  all  procedures  of 
interest  are  somewhat  dependent  on  sample  size.  For 
instance,  the  ST  procedure  (1987)  shows  improved  power  as 
sample  sizes  increase  from  750  to  2,000  (Nandakumar  k Stout, 
1993;  Stout,  1987).  Similarly,  Ben-Simon  and  Cohen  (1990) 
found  that  the  power  of  the  HR  increases  as  sample  size 
increases:  in  this  case,  the  sample  sizes  ranged  from  1,000 
to  4,000. 


It  is  suspected  that  the  power  of  the  FIFA  chi-square 

(Hislevy,  1986).  However,  no  empirical  work  has  been  done 
to  show  the  effects  of  various  sample  sizes  on  the 
performance  of  the  FIFA  chi-square  procedure.  Zwick  (1987) 

of  the  performance  of  the  FIFA  chi-square  procedure.  No 
other  sample  sizes  were  examined. 

Test  Length 

Two  levels  of  test  length  will  be  examined,  25  and  50. 
The  selected  levels  reflect  test  lengths  that  are  both 


typical  of  real  teat  data  (Ackerman,  1988;  Kandakumar  k 
Stout,  1993)  and  should  result  in  stable  IBT  ability 
estimates  (Hulin,  Lissak,  k Drasgow,  1982).  Findings  of 
prior  studies  have  indicated  that  the  sensitivity  of  the 
indices  is  affected  by  test  length.  It  is  well  known  that 


the  ST 


rly  fc 


85 

or  test  lengths  of  25 
items  or  less  INandakumsr,  1987;  Stout,  1987).  Nsndukumar 
(1994)  repotted  the  power  of  the  HR  procedure  drops  with  an 

inequality  (Nandakumar,  1991;  Fosenbaum,  1984).  In 
contrast,  Ben-Simon  and  Cohen  (1990)  found  as  the  number  of 
items  increased  the  power  of  the  RR  procedure  increased. 
According  to  Huraki  and  Engelhard  (1985),  increasing  the 
number  of  items  has  little  effect  on  the  performance  of 
FIFA.  In  fact,  Bock,  Gibbons,  and  Muraki  (1988)  have 
suggested  that  FIFA  works  well  with  as  many  as  60  to  100 


ui.idia.e,.sionality  should  be  investigated  for  nonnormal 
distributions  (De  Champlain  & Tang,  1993;  HanJsleton  5 
Bovlnelll,  1986;  McDonald,  1991;  Nandakumar, 


1991,  1994; 


stout,  19871 


The 


8S 


readily  be  seen.  First  of  all,  two  out  of  the  three 
procedures  nay  be  sensitive  to  violations  of  the  assunption 
of  a normal  latent  ability  distribution.  Although  both 
statistics,  T,  and  T^,  associated  with  the  ST  procedure  are 
nonparametric,  the  procedure  incorporates  a factor  analysis 
of  tetrachorics  (Stout,  1967],  which  assumes  normality  (Lord 
& Novick,  1968).  In  fact,  the  findings  from  the  De 
Champlain  and  Tang  (1993)  study  showed  that  the  performance 
of  the  Stout  procedure  under  the  unidimensional  nonnormal 
condition  resulted  in  inflated  Type  I error  cates. 

The  other  model  that  may  be  sensitive  to  Che  shape  of 
the  distribution  is  the  FIFA  procedure.  The  FIFA  model  is 

distribution,  which  results  in  a parametric  statistic,  G* 
(Mislevy,  1966).  However,  no  studies  have  been  completed 
Investigating  the  effects  of  nonnormality  on  the  performance 
of  the  FIFA  chi-square.  Only  the  HR  procedure  appears  to  be 
free  of  this  distributional  assumption.  However,  to  date  no 
studies  have  been  completed  to  investigate  this  assertion. 
Inter-trait  Correlation 


87 

for  this  is,  of  course,  the  response  set  approaches 
unidimensionality  with  each  increase  in  the  size  of  the 
inter-trait  correlation  (Reckase,  1979).  For  example, 
Seraphine  and  Miller  (1994)  found  under  the  nonnull 
condition  the  number  of  rejections  for  the  Stout  T procedure 
decreased  as  the  size  of  the  inter-trait  correlation 
Increased.  In  the  same  vein  Nandakumar  (1994)  found  the 
power  of  the  KR  procedure  dropped  as  the  inter-trait 
correlation  increased  from  .3  to  .7.  To  date  no  work  has 
examined  the  effects  of  inter-trait  correlation  size  on  the 
power  of  the  FIFA  procedure. 

DeBion  Layout 

In  sum,  the  power  of  the  ST  procedure,  HR  procedure, 
and  the  FIFA  procedure  were  examined  for  all  combinations  of 
test  length  (N),  sample  size  (J),  inter-trait  correlation 
(r),  and  distribution  (DT);  resulting  ina2X3X2X2 
design  or  24  conditions  for  each  procedure.  Each  of  the  24 
conditions  were  replicated  lOO  times. 

Simulation  Proceriura 

The  Model 

Bach  two-dimensional  data  set  was  generated  using  the 
compensatory  multidimensional  two  parameter  logistic  (K2PL) 
model.  The  M2PL  model  as  reported  by  Reckase  (1985)  is 
P(>r,_..i|aj  ft  '■  gxp[1.7(aje, ■><(,) 
l*exp(i,7 

where  Xj,  represents  the  response  (0,1)  on  Item  i by  person 


Both  item  parameters  and  item  direction  were  simulated 
to  provide  values  in  line  with  the  typical  values  estimated 

parameter  selection  process  followed  that  of  Oshima  and 
Hiller  (1990):  HID  parameters  were  randomly  selected  from  a 

standard  normal  distribution,  ranging  from  -2.0  and  2.0  by 
using  the  function  NORHAb  of  SAS  IML.  The  selection  process 
resulted  in  values  bounded  by  -1.97  and  1.68.  This  range 
is  in  line  with  values  from  published  tests:  (a)  Mislevy 

(1984)  b|  values  of  the  Armed  Services  Vocational  Aptitude 

(Hislevy  & Bock,  1984);  (h)  the  b values  from  the  SAT-verbal 


2.50  ( 


(c)  e 


1 the  H 


3 to  1.87  (Ackerman,  1988).  The  WDISC  parameters 
chosen  with  actual  test  data  in  mind.  These 
re  randomly  selected  from  a lognormal  distribution 
in  of  1.13  and  a standard  deviation  of  .60.  Most 
re  within  the  interval  of  .5  and  2.5;  the  lowest 

est  value  was  2.58.  The  resulting 


reported  by  Ackei 


values  adhered  closely  to  the  values  r 
(1988)  for  the  ACT  Assessment  Mathematics  Usage  Test  (.58  to 
2.39),  As  mentioned  earlier,  item  direction  was 
systematically  fixed  at  one  of  five  values,  0,  15,  30,  45, 


The  noonormal 


where  o*  denotes  the  standard  deviation  of  the  normal 
distribution;  whereas  u and  o denote  the  mean  and  standard 
deviation  of  the  lognormal  distribution,  respectively.  For 
this  study  = 0 and  o*  = .S,  and  p =1.13,  a = .60. 

of  latent  traits.  Mood,  Graybill,  and  Boes  (1974)  showed 
that  "if  w is  multivariate  normal  then  y'=  [exp(W[)  . . . 
®xp(Wj)  is  multivariate  lognormai"  (Algina  & Oshima,  in 


10). 


iidakumsr  (19931 


The  ImpleDC 


of  each  procedure  is 


discussed  note  fully  in  Che  following  paragraphs. 

The  ST  procedure 

The  Stout  T procedure  was  iaplemenCed  using  DIMTEST, 


is  the  generation  of  subtests.  So  to  inplenent  the  ST 
procedure,  DIHTBST  partitions  the  N items  of  the  test  into 
three  eubtests,  assessment  test  1 (ATI)  of  length  H, 
assessment  test  2 (AT2)  of  test  length  H,  and  the 
partitioning  teat  (?T)  of  test  length  (N--2H)  (Nanda)iunar  & 
Stout,  1993).  The  size  Of  the  ATI  and  AT2  was  determined  by 
DIHTEST  using  an  algorithm  developed  by  Nandakumar  and  Stout 
(1993).  It  was  found  the  optimal  size  of  the  assessment 
subtests  should  be  H=N/4  with  the  cutoff  factor  loading  for 
item  assignment  starting  at  .15. 

Recall,  the  assignment  of  items  to  ATI  is  determined 
through  a principle  axis  factor  analysis  of  tetrachorics. 

The  items  with  the  highest  loadings  of  the  same  sign  on  the 
second  factor  are  assigned  to  ATI.  The  impleaientation  of 
DIHTEST  requires  the  researcher  to  specify  the  number  of 
examinees  to  be  used  for  the  factor  analysis;  the  remaining 

statistic.  Nandakumar  (1994)  reported  that  although  a 
sample  size  of  500  is  optimal  for  the  factor  analysis,  a 
sample  size  of  at  least  250  is  necessary  for  the  analysis. 
Therefore,  for  examinee  population  size  of  J = 500,  250  were 


randomly  selected  for  the  factor  analysis,  leaving  250  for 
the  computation  of  the  statistic.  For  examinee  population 
sises  of  J=  1,000,  1,S00;  500  examinees  were  randomly 
selected  for  the  factor  analysis,  leaving  500  and  1,000, 
respectively,  for  the  computation  of  the  statistic. 

As  reported  in  Nandakvunar  and  Stout  [1993)  the  Stout  T 
procedure  provides  two  values  of  the  statistic,  T; 
represents  the  more  conservative  statistic,  originally 
developed  by  Stout  (1967),  and  represents  the  more 
powerful  statistic,  a modification  of  the  original 
statistic.  For  the  either  statistic,  the  null  hypothesis  of 
d;  =1  was  rejected  if  T 2 Z,,  where  Z,  is  the  upper  lOO(l-o) 
percentile  of  the  standard  normal  distribution,  a represents 
the  Type  I error  rate,  and  T represents  either  statistic. 

The  critical  values  were  x 1.96  and  Z„  x 2.33. 

The  Holland-Rosanbaum  Procedure 

The  Holland-Rosanbaum  procedure  was  implemented  using 
software  developed  by  Nandakumar  (1993).  The  aim  of  the 
procedure  is  to  test  the  conditional  dependence  of  each  N(N- 
l)/2  unique  paire  of  items,  where  N denotes  the  number  of 
items.  Each  pair  of  item  responses  are  conditioned  on  the 
total  of  observed  item  scores  excluding  the  pair  of  items  of 
interest.  To  test  the  null  hypothesis  of  conditional 
independence  and  monotonocity  for  each  individual  item  pair 
the  Hantel-Haenssel  statistic  is  used.  The  null  hypothesis 


lower  lODc 


set  at  .01  or  .05. 

To  determine  whether  the  iten  responses  for  H items 
exhibited  conditional  independence,  a simultaneous  test 
based  on  the  expected  number  of  rejections  when  the  per 
comparison  error  rate  is  controlled.  The  null  hypothesis  of 
conditional  independence  was  rejected  if  the  number  of  item 
pair  rejections  exceeded  qo,  where  q signifies  the  number  of 
unique  item  pairs  and  o is  set  at  either  .05  or  .01.  The 

The  Full-lntormatlon  Factor  Analysis  Procedure 

The  procedure  based  on  full-infomation  factor  analysis 
(FIFA)  was  implemented  using  TESTFACT  (Wilson,  Wood,  6 
Gibbons,  1991).  The  normal  item  response  model  is  estimated 

(KML)  estimation  procedure  as  outlined  by  Bock  and  Aitkin 
(19B1)  on  all  2*  possible  responses.  Because  the  EK 
solution  associated  with  MKL  estimation  converges  so  slowly, 
reasonable  start  values  are  required.  The  TESTFACT  default 
strategy  for  generating  start  values  is  through  the 
application  of  a principal  axis  factor  analysis  on 
tetrachoric  correlations.  This  approach  worked  well  for  the 
normal  data  sets.  However,  because  the  asaumption  of  latent 
trait  normality  is  fundamental  to  the  use  of  tetrachorics, 
this  approach  failed  when  applied  to  the  majority  of  the 


Tabli 


The  Cutoff  Values  (ag>  for  the  Hnlliitirt. 
Baaed  on  the  Bonferronl  Ineaualitv  fnr 
Length  (Ml  and  Staoiflcanee  Lavla  im 


25 


300  15 


.01 

.05 


300  3 

1225  61 


SO 


12 


Note,  n denotes  the  nuinber  of  unique  pairs  of  items. 


lit,  for  the 


cnal 


lognormal  data 
Bets,  the  facto 


successful  full- 


information  factor  analysis  of  a randomly  selected  lognormal 
data  were  used  as  starting  values  for  the  analysis  of  the 
remaining  lognormal  data  sets,  To  determine  whether  the 

square  values  were  computed  for  a single  data  set  using 
various  start  values:  the  observed  chi-square  remained  the 


Recall,  one  advantage  of  FIFA  is  that  HML  estimation 
has  a corresponding  chi-square  statistic.  However,  it  has 

chi-square  difference  statistic  to  offset  the  chi-square's 
sensitivity  to  negligible  departures  from  fit  (Mislevy, 
1986).  Therefore,  both  one-  and  two-factor  models  were 
estimated  and  from  this  the  observed  chi-square  statistic 
difference  was  calculated.  The  null  hypothesis  that  the 
more  parsimonious  model  (1-faetor  model  versus  2-factor 
model)  provided  the  best  fit  was  rejected  for  test 
lengths  of  25  if  x‘  (“  = -05,  df  = 24)  a 36.14  and  (a  = 
.01,  df  = 24)  a 42.98  and  for  test  lengths  of  50  if  x2  (a  = 
.05,  df  = 49)  2 67.50  and  x*  (d  = .01,  df  = 49)  2 76.15, 


Three  indices  are  considered  to  be  promising  tools  for 


(Stoi 


lidimensionality:  (a)  the  Stout 

<7,  1990);  (b)  the  chi-square  ass 


5-dtniensior 


CHAPTER  4 

RESULTS  AND  DISCUSSION 

The  power  of  the  Stout  T (ST),  Holland-Rosenbaum  (HR), 
and  full-information  factor  analytic  (FIFA)  procedurea  was 
Investigated  for  combinations  of  distribution  types  (DT), 
sample  size  (J),  test  length  (N),  and  inter-trait 
correlation  (r).  The  results  of  the  analyses  ace  organized 
into  two  sections,  each  corresponding  to  either  normal  or 
lognormal  distribution  types.  Within  each  section,  the 
results  for  various  combinations  of  sample  size,  test 


It  was  determined  that  the  pattern  of  rejection  rates 
for  0 : .05  and  a ° .01  is  similar  for  all  four  statistics. 
The  only  difference  is  that  the  rejection  rates  for  a = .01 
are  slightly  lower  than  for  a = .05.  Because  of  this,  the 
analyses  of  rejection  rates  for  a = .05  are  presented.  The 
results  for  a = .01  are  reported  in  the  Appendix. 

Results  for  Normal 


Of  the  procedures  for  all  factor  combinations.  Examination 
of  rejection  rates  suggested  the  presence  of  a third-order 


100 


102 

Table  2 


103 


for^Co^inationa  of^Sanple^Siie^f J 1 . Test^Lmo^  (Nl.^amd 


24 


42 


75 


99 


43 


13 


105 


Percent  Rejection  R 


the  .05  Sicmlf ieanee  Level  f 


106 


Table  S 

Siae  ?J1,  Test  Length  IHl.  fnter-tralt  Correlationaf ^Md^for 
the  Kormal  Distribution  (DT.  KaWnraall 


K=25  N=50 

Correlation  Correlation 

J .3  .7  .3  .7 

SOO  .405  -.053  2.543  .423 

1000  1.314  -.214  3.403  .260 

1500  1.734  -.228  4.703  .660 


107 


Table  6 

Average  Observed  Values  for  for  Combination*  of 


IDOO 

1500 


Correlation 


.305 


1.420 


3.029 

4.118 


Table 


Humber  (Percent!  of  Itea  Pairs  Re-lected  at  the  .OS 
Significance  Level  for  the  Holland-Roaenbaua  Procedure  for 
Conbinationa  of  Sample  Size  fJI.  Teat  Length  IHI.  Inter- 
tralt  Correlations,  and  for  the  Normal  Diatributlon  IDT. 


500  5.22 

U.7) 

1000  6.45 


(2.6) 


(5.2) 

95.39 


Note.  For  N=2S  there  were  300 
there  were  1,225  unique  item  p£ 
neareat  1/10. 


The  unlfom  effect 


The  percent  rejection  rates  of  the  were  uniformly 
high  for  the  normal  data  sets.  Table  8 shows  the  percent 
rejection  rates  of  C^.  The  power  dropped  only  slightly  for 
N ° 25  and  J s 1,000  and  may  have  been  due  to  sampling 
fluctuations . 

Table  9 shows  the  average  observed  values  of  the  for 
each  combination  of  factors.  The  values  of  the  statistic 
vary  as  a fimction  of  individual  effects.  That  is,  as  the 
sample  size  and  test  length  increased  and  the  magnitude  of 
the  inter-trait  correlation  decreased,  the  magnitude  of  the 
observed  increased.  But  this  has  little  impact  on  the 
power  of  the  statistic.  All  mean  values  exceeded  the 
critical  values  = 36.14  and  x*(49)  = 67.50,  which 

resulted  in  the  uniform  effect  reported  in  Table  8.  Thus, 
it  can  be  seen  the  pattern  of  the  average  observed  values  is 
in  line  with  the  rejection  rates  for  all  factor 
combinations . 

Results  for  Lognormal 

Visual  inspection  of  rejection  rates  suggested  the 
presence  of  an  uniform  effect  for  T-,  T^,  and  HHz  and  a 
third-order  effect  (J  X N X r)  for  G*.  Moreover,  the 
findings  indicate  that  the  G*  statistic  is  the  only  one  of 
the  three  that  exhibited  adequate  power  for  lognormal  data. 
The  results  for  the  lognormal  distribution  type  are  reported 
in  the  following  paragraphs. 


The  uniform  effect 


no 


Tables  10,  11,  and  12  report  the  percent  rejection 
rates  for  the  T , Tj,  and  MHr,  respectively.  All  three 
statistics  exhibited  uniformly  low  power.  Of  the  three,  the 
Tp  had  slightly  more  power  with  percent  rejection  rates 
ranging  from  0 to  5.  The  percent  rejection  rates  of  Tp 
ranged  from  0 to  3,  whereas  the  rejection  rates  of  MHs  were 
uniformly  zero.  It  is  likely  these  differences  can  be 
attributed  to  sampling  fluctuations. 

The  average  observed  values  of  Tp  and  Tp  are  imiformly 
low,  as  Tables  13  and  14  indicate.  Similar  patterns  emerge 
for  both  observed  values  and  percent  rejection  rates  for 
both  statistics.  The  average  percent  of  item  pairs  for  the 
HR  procedure  rejected  at  a = .05  for  all  factor  combinations 
is  shown  in  Table  15.  As  can  be  seen  in  general  the  HKz 
statistic  resulted  in  very  few  item  pair  rejections,  which 
supports  the  findings  reported  in  Table  12. 

The  effect  of  J.  W.  and  r 

Of  the  four  statistics,  the  G^,  alone,  demonstrated 
adequate  power  for  the  lognormal  distribution.  Table  16 
shows  the  percent  rejection  rates  for  the  statistic  under 
the  lognormal  condition.  The  percent  rejection  rates 
indicate  the  presence  of  a third  order  Interaction  of  J,  N, 
and  r.  The  following  patterns  emerge  for  G^:  (a)  the 

statistic  exhibited  good  to  excellent  power  with  rejection 
rates  ranging  from  86  to  98%  for  N = 50; 


(b]  except 


Table  a 


tiatle  at  the  .05 


aaole  Sl«e  fJ).  Teat 


112 


Table  9 

Statistic  for  Coabinatlona  of  Sample  Size  (Jl.  Test  Length 
(Wl.  Inter-trait  Cortelationa.  and  to»-  tho  sr^rjal 
Distribution  (DT.  NsMormall 


500  136.5  8G.3  466.6  300.9 
1000  248.7  146-0  753.1  546.4 
1500  358.9  206.5  1337.0  799.0 


Note.  If  N=2S  then  df=24;  if 


df‘49. 


PereentL-BeJaetion  Rates  for  T,  at  t 

hi*  . O'*  Sicmifinanra  LevRl 

rnifiotal  mstrihiition  (DT. 

Correlatic 


U4 


Parcgnt  Reiectlon  Rates  for  at  the  .05  SlonlflcancB  Level 


trait  Correlations,  and  for  the  Loanonaal  Dlatrlbutlon  (DT. 
LaLoanormall 


500  0 1 Z 2 

1000  1 2 0 2 


1500 


Percent  Rejection  Rates  for  Holland-Rosenbaujn  Procedure  at 
the  .05  Significance  Level  for  Coablnationg  of  *• — 


[Jl.  Tt 


id  for  the 


116 


Table  13 

Sin^(Jl?‘’Taat*Lenath*?Hl?~In^er-trait  Correlations.  a^~for 
the  Lognornal  Distribution  IDT.  L=Lo<n>oriiial  1 


500  -.094  -.125  -.393  -.330 
1000  -.077  -.079  -.265  -.213 
1500  -.141  -.161  -.540  -.362 


Average  Observed  Values  for  for  Combinations  of  Sa-pi» 
Siae  (J).  Test  Length  (W).  Inter-trait  Correlatlona.  ■ancLfor 
the  Lognoraal  Dlatrlbution  fPT.  LsLognoniiaH 


U8 


Correlation 


Correlation 


J .3 


there  were  1,225  unique  item  pairs.  Percents  rounded  to  the 
nearest  1/10. 


119 


Percent  Belectlon  Rates  tor  the  FIFA  Statistic  at  the  .05 
Significance  Level  for  ConblnationB  of  Sample  Sire  fj).  Test 


500  47  42 

1000  66  55 


1500  63  43 


120 

coabinatlons  of  N = 25  and  J = 1,500,  the  reiection  rates 
increased  slightly  with  increases  in  J and  decreased  or 
remained  the  same  with  increases  r;  and  (c)  contrary  to 
expcctions,  the  percent  rejection  rate  dropped  slightly  for 
N = 25  when  J increased  from  1,000  to  1,500  irrespective  of 
r.  Table  17  shows  the  average  observed  values  of  for  the 
lognormal  distribution.  The  pattern  of  observed  values  are 
in  concord  with  the  pattern  of  rejection  rates  of  the 

In  all,  the  rejection  rates  of  Tj,  Tj,  KHz,  and  g'  were 
lower  for  the  lognormal  condtion  than  for  the  normal 
condition.  Contrary  to  expectations,  the  nonparametric 
statistics  (T^,  T^,  and  MHz)  exhibited  virtually  a complete 

parametric  statistic  (G^),  which  should  be  somewhat 
sensitive  to  violations  of  the  assumption  of  normality, 
demonstrated  good  to  excellent  power  for  a test  length  of  50 


121 

Table  17 

Average  Observed  values  for  the  FIFA 

StatiBtic  for  Conbinatlons  of  Saaole  Slae  fJ).  Teat  Length 
(N),  Inter-trait  Correlationa.  and  for  the  Lognormal 
Distribution  (DT.  L^LognornaH 


SOO  27.4  35.2  76.1  41.3 
1000  40.9  31.7  94.1  81.5 
1500  40.2  35.7  111.7  89.4 


Note.  If  K*25  then 


if  N=S0  then  df=49. 


122 

uniformly  high  power  across  all  conditions  under  norraality. 
Secondly,  the  Holland-Rosenbaum  and  the  Stout  T procedure 
displayed  good  to  excellent  power  vinder  selected  factor 
combinations  for  normal  data:  (a)  the  Stout  T procedure  had 

good  to  excellent  power  for  large  sample  sixes  (J  t 1,000) 
with  long  tests  (N  = 50)  and  low  inter-trait  correlations  (r 
s .3):  and  (b)  the  Holland-Rosenbaum  had  excellent  power  for 
the  combined  effects  of  a large  sample  size  (J  a 1,500), 
long  test  (N  = 50),  and  low  inter-trait  correlation  (r  = 

.3).  Clearly,  the  Stout  T procedure  performed  well  under  a 
wider  range  of  conditions  than  did  the  Holland-Rosenbaum 


CHAPTER  5 
CONCLUSIONS 


ited 


124 

distribution:  (a)  the  Holland-Rosenbaum  procedure  and  the 

Stout  T procedure  both  exhibited  uniformly  low  power  across 
all  combinations  of  sample  sise  and  test  length  (Conclusion 
1);  and  (b)  the  FIFA  procedure  outperformed  the  Holland- 
Bosenbaum  procedure  and  the  Stout  T procedure  for  the 
lognormal  distribution  (Conclusion  2).  Both  trends  are 
somewhat  surprising  given  that  the  nonparametric  procedures 
appear  to  be  more  affected  by  distribution  type  than  the 
lone  parameteric  procedure.  However,  given  the  design  of 
the  present  study,  explanations  of  both  conluslons  are  at 


12S 


partitioning  items  into  subtests.  The  second  condition  is 
that  there  should  be  a sufficient  number  of  examinees  at 
each  ability  level  to  provide  stable  estimates  of  the  usual 
variance  and  unidimensional  variance  at  each  score  level. 
The  third  condition  is  that  there  should  be  sufficient 

values,  ft  reduction  in  test  score  variance  is  likely  to 


the  lognormal  data.  Because  the  lognormal  distribution  Is 
extremely  skewed;  the  factor  analysis  of  the  tetrachorics 
may  have  resulted  in  an  increased  Incidence  of  difficulty 
factors  (CSorlay,  19S1).  However,  it  is  unlikely  that  the 
presence  of  difficulty  factors  leads  to  reduced  power;  in 
fact,  difficulty  factors  are  more  likely  to  lead  to 
increased  power  for  which  the  Stout  T procedure  provides  a 


by  the  extreme  skewness  of  the  data,  because  extreme 
skewness  can  lead  to  either  floor  or  ceiling  effects.  In 
the  present  study,  because  lognormal  data  are  positively 


probably 


very  few  exaninees  were  assigned  to  ability  groups  at  the 
higher  score  levels,  Yet,  it  is  unlikely  that  instablity  of 
the  statistic  would  lead  to  such  consistently  low  rejection 


However,  the  presence  of  floor  effects  nay  influence 
the  power  of  the  procedure  yet  in  another  way.  One 
byproduct  of  floor  effects  is  a restriction  of  test  score 
variablity,  which  is  a violation  of  condition  three.  As  a 
result,  it  appears  values  for  the  usual  variance  estimate 
and  unidinensional  variance  estinate  are  reduced.  Of  the 
two  variance  estimates,  it  is  likely  the  usual  variance 
estinate  is  more  sensitive  to  floor  effects  than  the 


unidinensional  v 

Recall,  the  unidinensional  v 
comprised  of  summed  item  variances;  whereas,  the  usual 
variance  estimate  is  comprised  of  item  covariances  in 
addition  to  summed  item  variances.  Floor  effects  affect 
both  variances  and  covariances;  in  fact,  such  effects  can 
result  in  negative  covariances.  Therefore,  not  only  Is  it 
possible  for  the  usual  variance  estimate  to  be  close  to  th 
unidinensional  variance  estimate  in  value,  it  nay  also  be 
less  than  the  unidimensional  value.  The  existence  of 
negative  item  covariances  nay  provide  an  explanation  for  t 
negative  values  of  Tj  and  shown  in  Tables  6 and  5. 

The  effective  performance  of  the  Holland-Rosenbaum 


procedure  may  also  require  sufficient  item  variability  and  a 


sufficient  number  of  examinees  at  each  score  level.  Recall, 
before  testing  the  dependency  of  item  pairs,  examinees  are 
subdivided  into  homogeneous  ability  groups  on  the  basis  of 
their  total  scores  excluding  the  scores  of  the  pair  of  items 
being  tested.  In  the  presence  of  extreme  skewness  it  is 
likely  that  the  higher  score  level  groups  will  either  be 
empty  or  have  too  few  examinees,  and  this  may  adversely 
affect  the  power  of  the  MHz  statistic. 

Conclusion  2 

The  FIFA  procedure  exhibited  more  power  than  the  other 
two  for  the  lognormal  data  sets.  In  fact,  the  power  was 
good  to  excellent  for  a test  length  of  50.  However, 
interpretation  of  these  findings  is  difficult,  if  not 
impossible,  without  an  examination  of  Type  I error  rates  for 
lognormal  data  sets,  Without  knowledge  of  the  Type  I error 
rates,  it  is  hard  to  know  whether  to  attribute  the  high 
number  of  rejections  to  the  robustness  of  the  chi-sguare 
difference  to  violations  of  normality,  or  to  inflated 
rejection  rates  even  under  the  null  condition.  Recall,  upon 
examination  of  the  normal  null  condition,  Zwick  (1987)  found 
resulted  in  an  overestimation  of  factors.  Therefore,  it 
is  very  likely  that  the  G^  results  in  overly  high  rejection 
rates  in  the  presence  of  negligible  departures 
from  model  fit.  More  empirical  work  is  needed  to  resolve 


this  ii 


128 


Conclusions  Concerning  normal  Distributions 
Three  afldltional  conclusions  can  he  made  regarding  the 
performance  of  the  procedures  under  nonial  distribution 
conditions:  (a)  On  average,  the  FIFA  procedure  outperformed 

the  other  two  procedures  for  all  combinations  of  sample 
size,  test  length,  and  inter-trait  correlations  (Conclusion 
3);  (b)  Adequate  power  for  both  the  Stout  T procedure  and 
the  Holland-Rosenbaum  procedure  is  limited  to  long  tests  (N 
« SO)  with  a low  inter-trait  correlation  (r  = .3) 

(Conclusion  4):  and  (c)  The  power  of  Che  Stout  T procedure 
was  usually  higher  Chan  the  power  of  the  Holland-Rosenbaum 
procedure  tor  ail  factor  combinations  (Conclusion  5). 
Conclusion_3 

size,  test  length,  and  Inter-tralt  correlations.  Given  the 
results  of  other  studies  (Nandakumar,  1994;  Zwick,  1987), 
the  uniformly  high  rejection  rates  across  all  factors  is 

observed  Type  I error  rate  often  exceeded  the  nominal  rate, 
leading  to  overfactorization.  With  this  in  mind,  it  has 
been  recommended  that  the  significance  of  the  chi-square  be 


:luslon  4 


Both  the  Stout  T procedure  and  the  Holland-Rosenbaum 
procedure  exhibited  fair  to  excellent  pouer  for  a test 
length  of  50  items  and  an  inter-trait  correlation  of  .3. 

The  Holland-Rosenhaun  procedure  perfonned  as  was  expected; 
that  is,  others  found  that  the  procedure  works  beet  for 
large  samples,  long  tests  (Ben-Simon  & Cohen,  1990)  and  low 
inter-trait  correlations  (Nandakumar,  1994). 

In  the  present  study,  however,  the  performance  of  the 
Stout  T procedure  was  slightly  discrepant  from  expectations. 
For  instance,  Nandakumar  and  Stout  (1993,  p.  59)  reported 
"the  power  is  very  high  for  p - .7  for  2,000  examinees,  and 
the  power  is  very  good  for  p = .7  for  750  examinees."  In 
the  present  study,  the  rejection  rates  for  an  inter-trait 
correlation  of  .7  was  low  across  sample  sizes  and  test 

It  is  unlikely  that  differences  in  sample  size,  alone, 
account  for  the  aforementioned  discrepancy.  Nandakumar  and 
Stout  (1993)  and  Stout  (1987)  generated  item  responses  so 
that  one  third  of  the  items  loaded  strictly  on  the  first 
trait,  one  third  loaded  strictly  on  the  second  trait,  and 
the  remaining  third  loaded  on  both  dimensions.  The 
composite  reference  vector,  as  defined  by  Wang  (1986),  would 

from  the  reference  composite.  In  the  present  stu 


one  fifth  of  the  items  loaded  strictly  on  the  first  trait; 
the  remaining  four-fifths  loaded  on  both  traits.  This 
remaining  four-fifths  was  divided  evenly  among  the  selected 
item  directions  of  15,  30,  45,  and  60.  As  a result,  the 
dependency  structures  of  the  Nandakumar  and  Stout  (1993) 
data  sets  exhibit  stronger  multidimensionality  than  the 
dependency  structures  generated  for  this  study. 


For  most  factor  combinations  the  power  of  the  Stout  T 
procedure  was  usually  higher  than  the  power  of  the  Holland- 
Rosenbaum  procedure.  Findings  from  earlier  studies  have 
indicated  that  the  Kolland-Rosenbaum  procedure  has  a 
conservative  observed  Type  I error  rate  ( Nandakumar , 1994; 
Zwick,  1987)  and  exhibits  less  power  than  the  Stout  T 
procedure  < Nandakumar , 1994).  However,  neither  of  these 
earlier  studies  had  included  replications,  which  limited  the 
generalirability  and  interpretabllity  of  their  findings. 

Pirections  for  Future  Research 
Results  of  the  present  study  indicate  two  major 

research  is  needed  to  strengthen  the  generalizability  of 
investigations  of  the  Stout  T procedure,  the  Kolland- 
Rosenbaum  procedure,  and  the  FIFA  procedure.  Secondly,  the 
focus  of  additional  research  should  be  on  finding 
alternative  ways  of  implementing  each  of  the  three 


131 

procedures  to  enhance  performance.  Each  reconmendation  is 
addressed  in  the  following  paragraphs. 

Recommendation  1.  The  generalisability  of  the  findings 
is  confined  to  conditions  investigated  in  the  present  study. 
First,  the  inclusion  of  additional  levels  of  sample  size, 

enhance  the  generallzahility  of  the  study.  For  instance, 
it  is  suspected  the  power  of  the  Stout  T procedure  and  the 
Holland-Rosenbauffi  procedure  would  improve  for  both  larger 
sample  sizes  (J  = 2,000,  2,500)  and  longer  tests  (H  = 75). 
Futhermore,  it  would  be  of  interest  to  examine  the 

sampled  from  a moderately  skewed  distribution.  It  is 
probable  that  the  inclusion  of  a moderately  s)tewed 
distribution  would  still  result  in  a drop  in  power,  although 

s)rewed  distribution  in  the  present  study. 

dependency  structures  would  strengthen  the  findings  of  the 

include  two-dimensional  dependency  structures  similar  to  the 
ones  generated  by  Stout  (1987)  and  Nandakumar  and  Stout 
(1993).  In  this  case,  It  is  lijrely  the  Holland-Rosenbaum 
procedure  would  exhibit  more  power  and  be  affected  to  a 


The  noet  serious  llmitetion  of  the  present  study  is  the 
exclusion  of  the  null  condition,  particularly  for  the 
lognormal  condition,  Although  Zwlck  (1987)  and  Nanda)iu«iar 
(1991)  report  the  Type  1 error  rates  for  the  FIFA  procedure, 
it  remains  an  open  question  whether  the  acceptable  error 
rates  will  result  under  the  lognormal  condition.  If  so,  the 
FIFA  chi-square  nay  be  useful  for  detecting  departures  from 
multidimensionality  for  skewed  data.  In  fact,  knowledge  of 
the  Type  I error  rates  of  each  of  the  procedures  under 
varying  latent  distributions  would  enable  comparisons  to  be 

Recommendation  2.  Improvements  to  enhance  the 
performance  of  the  Stout  T procedure,  the  Holland-Rosenbaum 
procedure,  and  the  FIFA  procedure  should  be  considered  for 

follows. 

First  of  all,  the  Stout  T procedure  should  exhibit 
greater  power  If  a distribution  free  factor  analytic 
procedure  is  used  for  the  selection  of  items  for  the 
assessment  subteat.  Nandakumar  and  Stout  (1993)  reported 
investigating  the  application  of  a two-factor  quadratic 
nonlinear  factor  analysis  with  a guessing  correction. 
Although  no  results  are  formally  reported,  they  claimed  the 
application  of  the  nonlinear  factor  analysis  resulted  in  no 
differences  in  T values.  However,  their  investigation  was 


APPENDIX 


for  Cnmhinst r 

)f  Sample  Slie  (, 

the  .01  .9i  nnl  f 1 r-Anr»  l.ov»l 

N=25 

N=50 

Correlation 
.3  .7 

Correlation 
.3  .7 

H 500 

2 1 

59  8 

1000 

22  0 

80  0 

1500 

35  0 

96  6 

0 500 

0 0 

0 1 

1500 

1 0 

0 1 

134 


13S 


Table  19 

Percent  Selection  Rates  for  T,  at  the  .01  Significance  Level 
for  CoBblnatlone  of  Sanmia  Site  fJ).  Teat  Length  IHl.  Inter- 
trait  Correlations,  and  Distribution  fPT.  W=Horm»i  and 


Correlation  iiorrelation 


1500 


13S 

Table  20 

Percent  Reiectlon  Bates  for  Holland-Roaenbaua  Procedure  at  the 
■01  Significance  Level  for  Coabinations  ot  Saaple  Site  IJ>. 
Teet  Length  fHl.  Inter-trait  Correlations,  and  Dlstrlbutlnn 
IDT.  K=HornaI  and  L°Loarinr»iaT  ' 


Correlation"  correlation 


Hiimhor  (Percent!  of  Itsu  Pairs  Raleetad  at  the  ,01 
Significance  Level  for  the  Hoi land-Boaenbaun  Proeadure  far 
Combinations  of  Sample  Site  tJI.  Test  Length  IHi.  Inter-trait 
Correlations,  and  Dlatrlbution  IDT.  MaWni-jai  and  L=Looniirnal  ' 


there  were  1,22S  unique  i 
nearest  1/10. 


1.35  12.03  4.13 

(0.5)  (1.0)  (0.3) 

1.24  22.07  4.21 

(0.4)  (1.8)  (0.3) 

1.11  34.58  4.50 

(0.4)  (2.8)  (0.4) 

1.68  2. 80  2.95 

(0.6)  (0.2)  (0.2) 

1.52  2.55  2.49 

(0.5)  (0.2)  (0.2) 

2.00  2.67  2.32 

(0-11  U.3)  (1.2) 

300  unique  item  pairs;  for  Hs50 
m pairs.  Percents  rounded  to  the 


138 


Percent  Heieetion  Rates  for  the  FIFA  Chi-Souare  Differenca 
Statistic  at  the  .01  Significance  Level  for  CiMbinetlons  of 

and  Distribution  IDT.  N=Normal  and  Lsl.fymnrjul  i 


500  96 


500  20 

1000  34 

1500  3S 


17 


100 


100 

67 

93 

97 


REFEREKCES 


Ackerman,  T.  A.  (198B,  April).  An  explanation  of  different 
Item  functioning  from  a Multidimensional  perapectlve. 
Paper  preaented  at  the  annual  meeting  of  the  American 
Educational  Research  Association,  Kew  Orleans. 

Ackerman,  T.  A.  <1989).  Unidimensional  IHT  calibration  of 
compensatory  and  noncompensatory  multidimensional 
items.  Applied  Psychological  Measurement.  13.  113-127. 

Ackerman,  T.  A.  (1990,  April).  An  evaluation  of  the 

multidimensional  parallelism  of  the  EAAP  mathematlce 
test.  Paper  presented  at  the  annual  meeting  of  the 
American  Educationai  Research  Association,  Boston. 

Agresti,  A.  <1990).  Categorical  data  analysis.  New  York: 

John  Wiley  St  Sons. 

Algina,  J.,  6 Oshima,  T.C.  (in  press).  Type  1 error  rates 

for  Huynh's  general  approximation  and  improved  general 
approximation  tests.  British  Journal  of  Mathematical 
and  Statistical  Psvcholoav- 

Algina,  J.,  Oshima,  T.C.,  & Tang,  K.L.  (1991).  Robustness  of 
Yao's,  James',  and  Johansen's  tests  under  variance- 
covariance  heteroscedastlcity  and  non-normality. 

Journal  of  Educational  Statistics.  13.  281-290. 

Andersson,  C.  G.,  Christoffersson,  A.,  6 Muthen,  B.  (1974) 
FADIV:  A computer  program  for  factor  analvsta  of 
dichotomized  variables  (Report  No.  74-1)  [Computer 
program).  Uppsala,  Sweden:  Uppsala  University, 
Statistics  Department. 

Bartholomew,  D.  J.  (1980).  Factor  analysis  for  categorical 
data  (with  discussion).  Journal  of  the  Bnvai 
Statistical  Society.  Series  B,  42,  293-321. 

Bartlett,  M.  s.  (1950).  Tests  of  significance  in  factor 
analysis.  British  Journal  of  Psychology.  3,  77-85. 


139 


tests.  Journal  of  Applied 


142 


Harrison,  D.  A.  (1986).  Robustness  of  IRT  paraiaeter 

estimation  to  violations  of  the  unidimensionality 
assumption.  Journal  of  Educational  Statistics.  11 . 91- 
115. 

Hattie,  J.  (1984).  An  empirical  study  of  various  indices  fox 
determining  unidimensionality.  Multivariate  Behavioral 
Research.  19 . 49-78. 

Hattie,  J.  (1985).  Methodology  review:  Assessing 

unidimensionality  of  tests  and  items.  Applied 
Psychological  Measurement.  9,  139-164. 

Holland,  P.  W.  & Rosenbaum,  p.  R.  (1986).  Conditional 

association  and  unidimensionality  in  monotone  latent 
variable  models.  Annals  of  Statistics.  14.  1523-1543. 

Horn,  J.L.  (1965).  A rationale  and  test  for  the  number  of 

factors  in  factor  analysis.  Psvchometrika.  30.  179-185. 

Hulin,  C.,  Drasgow,  F.,  Sc  Parsons,  C.  (1983).  Item  response 
theory:  Application  to  psychological  measuraaant. 

Homeland,  IL:  Oow  Jones-lrwin, 

Hulin,  C.,  Lissak,  R.  I.,  Sc  Drasgow,  F.  (1982).  Recovery  of 
two-  and  three-parameter  logistic  item  characteristic 
curves:  A monte  carlo  study.  Applied  Psychological 
Measurement.  6,  249-260. 

Humphreys,  L.  C.  (1985).  General  Intelligence:  An 

integration  of  factor,  test,  and  simplex  theory.  In  B. 
J.  Holman  (Ed.),  Handbook  of  intelligence:  Theories, 

measurements,  and  applications  (pp.  201-224).  Hew 
York:  John  Wiley  and  Sons. 

Johnson,  M.  F.,  Ramberg,  J.  s.,  & Wang,  C.  (1982).  The 
Johnson  translation  system  in  Monte  Carlo  studies. 
Communications  in  Statistics-Simulation  and 
Computation.  11,  521-525. 

Kingston,  N.,  Sc  McKinley,  R.  (1983,  April).  Assessinc  the 
Structure  of  the  GRE  general  test  using  confirmatory 
multidimensional  item  response  theory.  Paper  presented 
at  the  annual  meeting  of  the  American  Educational 
Research  Association,  New  Orleans. 

Lawley,  D.  N.  (1943).  On  problems  connected  with  item 

|®l®Ction  and  test  construction.  Proceedings  of  the 


143 


Lord,  F.  M.  {1952).  Theory  of  test  acores.  (Psychometric 
Monograph  Mo.  7).  Psychometric  Society. 

Lord,  F.  M.  (1968).  An  analysis  of  the  Verbal  Scholastic 

Aptitude  Test  using  Birnbauo's  three-parameter  logistic 
model.  Educational  and  Psychological  Measurement.  2B, 
989-1020. 


Lord,  F.  M.  (1980).  Application  of  Item  response  theory  to 
practical  testing  problems,  Reading,  MA:  Addison- 

Wesley. 

Lord,  F.  M.,  & Kovick,  M.  R.  (1968).  Statistical  theories  of 
mental  test  scores.  Heading,  MA:  Addison-Wesley. 

Mantel,  N.  8 Haenspel,  W.  (1959).  Statistical  aspects  of  the 
retrospective  study  of  disease.  Journal  of  the  national 

McDonald,  R.  P.  (1962).  A general  approach  to  nonlinear 
factor  analysis.  Psvchometrika.  27,  397-415, 

McDonald,  R.  p.  (1967).  Nonlinear  factor  analysis. 
Psvchoaetrle  Monograph  Ho.  15.  32(4,  Pt.  2). 

McDonald,  R.  P.  (1979).  The  structural  analysis  of 

multivariate  data:  A sketch  of  a general  theory. 
Wultivariate  Behavioral  Research.  14,  21-38. 

McDonald,  R.  p.  (1981).  The  dimensionality  of  test  and 

items,  British  Journal  of  Mathematical  and  Statistical 
Psychology.  34-  100-117. 


<.  S.  (1974).  Difficulty  factors 
Journal  of  Mathematical  and 


McKinley,  R.  L.,  & RecKase,  M.  D.  (1983).  MAXLOGr  A 

computer  program  for  the  eatlmation  of  parameters  of  a 
multidimensional  logistic  model.  Behavior  Research 
Methods  and  Instrumentation.  1§,  389-390. 

Miller,  M.  D.,  Oahima,  T.  C.,  & Seraphine,  A.  E.  (1993, 
April).  Estimation  of  multidimensional  IRT.  Paper 
presented  at  the  annual  meeting  of  the  American 
Educational  Research  Association,  Atlanta,  GA. 

Hislevy,  R.  J.  (1984).  Estimating  latent  distributions. 
Psvchometrika.  49,  359-381. 

analysis  of  categorical  variables.  Journal  of 
Educational  Statistics.  11,  3-31. 

Hislevy,  R.  J.,  & Bock,  R.  D.  (1984).  Item  operating 

characteristics  of  the  Armed  Services  Aptitude  Battery 
(ASVAfl),  Form  8A,  (Tech.  Hep.  N00014-83-C-0283 ) . 
Washington,  DCt  Office  of  Naval  Research. 

Muraki,  E.  & Engelhard,  G.  (1985).  Full-information  factor 
analysis:  Applications  of  EAP  scores.  Applied 


BIOGRAPHICAL  SKETCH 


I certify  that  I have  read  this  study  and  that  in  ny 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality,  as 


Counselor  Education 


This  diseertation  was  submitted  to  the  Graduate  Faculty 
of  the  College  of  Education  and  to  the  Graduate  School  and 
was  accepted  as  partial  fulfillment  of  the  requirements  for 
the  degree  of  Doctor  of  Fhilosphy. 


aduate  School 


